Literature DB >> 29668840

GGRaSP: a R-package for selecting representative genomes using Gaussian mixture models.

Thomas H Clarke¹, Lauren M Brinkac^1,2, Granger Sutton¹, Derrick E Fouts¹.

Abstract

Motivation: The vast number of available sequenced bacterial genomes occasionally exceeds the facilities of comparative genomic methods or is dominated by a single outbreak strain, and thus a diverse and representative subset is required. Generation of the reduced subset currently requires a priori supervised clustering and sequence-only selection of medoid genomic sequences, independent of any additional genome metrics or strain attributes.
Results: The Gaussian Genome Representative Selector with Prioritization (GGRaSP) R-package described below generates a reduced subset of genomes that prioritizes maintaining genomes of interest to the user as well as minimizing the loss of genetic variation. The package also allows for unsupervised clustering by modeling the genomic relationships using a Gaussian mixture model to select an appropriate cluster threshold. We demonstrate the capabilities of GGRaSP by generating a reduced list of 315 genomes from a genomic dataset of 4600 Escherichia coli genomes, prioritizing selection by type strain and by genome completeness. Availability and implementaion: GGRaSP is available at https://github.com/JCVenterInstitute/ggrasp/. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Species

Mesh：

Year: 2018 PMID： 29668840 PMCID： PMC6129299 DOI： 10.1093/bioinformatics/bty300

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Next-generation sequencing technologies have resulted in a large number of publicly available microbial genome sequences. The number of genomes available for comparative genomic analysis can exceed what can be feasibly visualized or analyzed (Chan ; Chavda ; Zaslavsky ). Additionally, sequencing of clonal or nearly clonal bacterial pathogens involved in disease outbreaks (e.g. Acinetobacter baumannii, Escherichia coli and Klebsiella pneumoniae) can skew the analyses; therefore, a reduction in genome redundancy to maximize diversity is necessary (Chan ). One common method to reduce sequence redundancy while minimizing information loss is to cluster genomes by their nucleotide distance metrices and from each cluster select one genome, often a medoid (the genomes with the minimal combined distance to the other genomes in the cluster) (Chan ; Moreno-Hagelsieb ), as a representative. However, these methods require the user to a priori specify either the number of clusters or a distance cutoff, and they do not allow the user to use the highest quality (i.e. most complete) representative genome for each cluster. Likewise, no dedicated program exists for loading and selecting these genomes.

2 Materials and methods

Here, we introduce GGRaSP (Gaussian Genome Representative Selector with Prioritization), a R-package and associated executable Rscript program that generates a list of prioritized representative genomes from either supervised or unsupervised clustering of related genomes. GGRaSP supports three forms of input to describe the relationship between the genomes: (i) a phylogeny in Newick format; (ii) a distance or similarity matrix; or (iii) an aligned multiple FASTA file. GGRaSP uses hierarchical clustering in the hclust R function or the APE R-package to create phylogenies from (ii) and (iii) (Paradis ). By default, GGRaSP prioritizes medoids as representative genomes in order to minimize the loss of information, but this can result in removal of genomes that contain regions of interests (e.g. plasmids, antibiotic resistance islands, pathogenicity islands and prophage), have a more complete assembly, or are from a given project. Users can therefore specify criteria of genomes for selection as representatives by generating a text file containing tiered ranks of the genomes. GGRaSP can cluster genomes using supervised methods, including specifying the number of clusters or the cluster cut-off distance, but it also allows for unsupervised clustering by using Gaussian mixture models (GMMs) to identify a cut-off value that separates the most closely related genomes from the more diverse genomes. GMMs of sequence distances have previously been used to model the evolutionary relationship between multiple genomes in metagenomes (e.g. Alneberg ; Ji ), and to model homologs descending from distinct ancient large-scale duplications in various eukaryotic organisms (e.g. Cui ; Schwager ). The GMM model could be biased or limited by collections of genomes which contain a single branch of highly related genomes (for which GGRaSP will select a cutoff that will only cluster that single branch) or a set of genomes that can be best modeled by a single Gaussian peak (in which case GGRaSP cannot find a cutoff). In GGRaSP, GMMs are calculated using expectation maximization via mixtools or bgmm (Benaglia ; Biecek ). Multiple Gaussian distributions are tested incrementally until the addition of a distribution is not significant by the Likelihood Ratio test or exceeds the user defined limit. After the GMM is cleaned by removing overlapping and low count distributions, the inflection point between the first two distributions is used as the cut-off to generate the clusters (see dotted vertical line, Fig. 1). The default pipeline behavior is described earlier, but many of the parameters for the GMM-based threshold calculation are user-modifiable for the cases where the GMM varies from the default model.

Fig. 1.

GGRaSP based reduction of 4600 E.coli genomes 4600 E.coli genomes were downloaded from NCBI RefSeq (A), clustered using a cut-off (shown as dotted line) determined by GMM (B), and reduced to 315 representative genomes (C). The colored branches in (A) denote branches reduced to a single node in (C) GGRaSP can output multiple supporting files as is described in detail on the R help pages including: tab-delineated files with information on the clusters; ggplot2-based images showing the GMM, the initial or the final phylogenies (Wickham, 2009) with colorspace to determine the hues of GMM and phylogeny shading (Ihaka ); the Newick files for any phylogeny used in GGRaSP; and the iTOL-formatted text files showing the clusters on the phylogenies (Letunic and Bork, 2016). A Rscript version of GGRaSP to run on a command line to facilitate high-throughput analyses is also provided.

3 Usage scenario

To demonstrate the usefulness of GGRaSP, we downloaded 4600 Escherichia genomes from NCBI RefSeq on 2/2/2017 using the downloader script in the LOCUST package (Brinkac ). A whole genome-based Average Nucleotide Identity (gANI) matrix was calculated with Mash (Ondov ). The genomes were ranked, in order by: whether it was a type strain; whether it was circular; and whether it was complete. The remaining genomes were ranked by the number of contigs and genes according to the LOCUST downloader output. The similarity matrix and the ranking file were input to GGRaSP, which computed a cut-off of 1.09% identity after modeling 9 Gaussian distributions (10 before clean-up), leading to a selection of 315 representative genomes in 98 min and 2s (Fig. 1, Supplementary Fig. S1). When using to a priori cutoff of 96.5% gANI cutoff suggested for species (Varghese ), only nine clusters were generated with the largest cluster containing 98.9% of the genomes. Ranking the genomes as described earlier increased the completeness of retained genomes compared to selecting the representatives from an unranked set number of complete genomes (from 6.7 to 25.4%) and mean N50 (from 205 to 556 kb). All input and output files for these runs and the a priori cutoffs are available on the GitHub repository.

4 Conclusion

As the number of sequenced genomes available for comparative genomic analysis continues to expand, the need to generate robust representative genomic subsets will increase. Building off the statistical, bioinformatic, and graphical capabilities of R, GGRaSP and the accompanying Rscript provides a single and customizable platform to run multiple analyses to generate a subset of representative genomes. The user can specify clustering parameters and levels of importance for ranking the genomes, thus allowing for both generalizable high-throughput and more dataset specific use. Click here for additional data file.

13 in total

1. Microbial species delineation using whole genome sequences.

Authors: Neha J Varghese; Supratim Mukherjee; Natalia Ivanova; Konstantinos T Konstantinidis; Kostas Mavrommatis; Nikos C Kyrpides; Amrita Pati
Journal: Nucleic Acids Res Date: 2015-07-06 Impact factor: 16.971

2. Phylogenomic clustering for selecting non-redundant genomes for comparative genomics.

Authors: Gabriel Moreno-Hagelsieb; Zilin Wang; Stephanie Walsh; Aisha ElSherbiny
Journal: Bioinformatics Date: 2013-02-08 Impact factor: 6.937

3. Widespread genome duplications throughout the history of flowering plants.

Authors: Liying Cui; P Kerr Wall; James H Leebens-Mack; Bruce G Lindsay; Douglas E Soltis; Jeff J Doyle; Pamela S Soltis; John E Carlson; Kathiravetpilla Arumuganathan; Abdelali Barakat; Victor A Albert; Hong Ma; Claude W dePamphilis
Journal: Genome Res Date: 2006-05-15 Impact factor: 9.043

4. LOCUST: a custom sequence locus typer for classifying microbial isolates.

Authors: Lauren M Brinkac; Erin Beck; Jason Inman; Pratap Venepally; Derrick E Fouts; Granger Sutton
Journal: Bioinformatics Date: 2017-06-01 Impact factor: 6.937

5. A novel method of consensus pan-chromosome assembly and large-scale comparative analysis reveal the highly flexible pan-genome of Acinetobacter baumannii.

Authors: Agnes P Chan; Granger Sutton; Jessica DePew; Radha Krishnakumar; Yongwook Choi; Xiao-Zhe Huang; Erin Beck; Derek M Harkins; Maria Kim; Emil P Lesho; Mikeljon P Nikolich; Derrick E Fouts
Journal: Genome Biol Date: 2015-07-21 Impact factor: 13.583

6. Mash: fast genome and metagenome distance estimation using MinHash.

Authors: Brian D Ondov; Todd J Treangen; Páll Melsted; Adam B Mallonee; Nicholas H Bergman; Sergey Koren; Adam M Phillippy
Journal: Genome Biol Date: 2016-06-20 Impact factor: 13.583

7. MetaSort untangles metagenome assembly by reducing microbial community complexity.

Authors: Peifeng Ji; Yanming Zhang; Jinfeng Wang; Fangqing Zhao
Journal: Nat Commun Date: 2017-01-23 Impact factor: 14.919

8. The house spider genome reveals an ancient whole-genome duplication during arachnid evolution.

Authors: Evelyn E Schwager; Prashant P Sharma; Thomas Clarke; Daniel J Leite; Torsten Wierschin; Matthias Pechmann; Yasuko Akiyama-Oda; Lauren Esposito; Jesper Bechsgaard; Trine Bilde; Alexandra D Buffry; Hsu Chao; Huyen Dinh; HarshaVardhan Doddapaneni; Shannon Dugan; Cornelius Eibner; Cassandra G Extavour; Peter Funch; Jessica Garb; Luis B Gonzalez; Vanessa L Gonzalez; Sam Griffiths-Jones; Yi Han; Cheryl Hayashi; Maarten Hilbrant; Daniel S T Hughes; Ralf Janssen; Sandra L Lee; Ignacio Maeso; Shwetha C Murali; Donna M Muzny; Rodrigo Nunes da Fonseca; Christian L B Paese; Jiaxin Qu; Matthew Ronshaugen; Christoph Schomburg; Anna Schönauer; Angelika Stollewerk; Montserrat Torres-Oliva; Natascha Turetzek; Bram Vanthournout; John H Werren; Carsten Wolff; Kim C Worley; Gregor Bucher; Richard A Gibbs; Jonathan Coddington; Hiroki Oda; Mario Stanke; Nadia A Ayoub; Nikola-Michael Prpic; Jean-François Flot; Nico Posnien; Stephen Richards; Alistair P McGregor
Journal: BMC Biol Date: 2017-07-31 Impact factor: 7.431

9. Clustering analysis of proteins from microbial genomes at multiple levels of resolution.

Authors: Leonid Zaslavsky; Stacy Ciufo; Boris Fedorov; Tatiana Tatusova
Journal: BMC Bioinformatics Date: 2016-08-31 Impact factor: 3.169

10. Comprehensive Genome Analysis of Carbapenemase-Producing Enterobacter spp.: New Insights into Phylogeny, Population Structure, and Resistance Mechanisms.

Authors: Kalyan D Chavda; Liang Chen; Derrick E Fouts; Granger Sutton; Lauren Brinkac; Stephen G Jenkins; Robert A Bonomo; Mark D Adams; Barry N Kreiswirth
Journal: MBio Date: 2016-12-13 Impact factor: 7.867

8 in total

1. SARS-CoV-2 Delta variant isolates from vaccinated individuals.

Authors: Lauren Brinkac; Sheila Diepold; Shane Mitchell; Stephanie Sarnese; Lee F Kolakowski; William M Nelson; Katharine Jennings
Journal: BMC Genomics Date: 2022-06-04 Impact factor: 4.547

2. Enterobacter hormaechei subsp. hoffmannii subsp. nov., Enterobacter hormaechei subsp. xiangfangensis comb. nov., Enterobacter roggenkampii sp. nov., and Enterobacter muelleri is a later heterotypic synonym of Enterobacter asburiae based on computational analysis of sequenced Enterobacter genomes.

Authors: Granger G Sutton; Lauren M Brinkac; Thomas H Clarke; Derrick E Fouts
Journal: F1000Res Date: 2018-05-01

3. Global genomic similarity and core genome sequence diversity of the Streptococcus genus as a toolkit to identify closely related bacterial species in complex environments.

Authors: Hugo R Barajas; Miguel F Romero; Shamayim Martínez-Sánchez; Luis D Alcaraz
Journal: PeerJ Date: 2019-01-14 Impact factor: 2.984

4. Bacteriophage targeting of gut bacterium attenuates alcoholic liver disease.

Authors: Yi Duan; Cristina Llorente; Sonja Lang; Katharina Brandl; Huikuan Chu; Lu Jiang; Richard C White; Thomas H Clarke; Kevin Nguyen; Manolito Torralba; Yan Shao; Jinyuan Liu; Adriana Hernandez-Morales; Lauren Lessor; Imran R Rahman; Yukiko Miyamoto; Melissa Ly; Bei Gao; Weizhong Sun; Roman Kiesel; Felix Hutmacher; Suhan Lee; Meritxell Ventura-Cots; Francisco Bosques-Padilla; Elizabeth C Verna; Juan G Abraldes; Robert S Brown; Victor Vargas; Jose Altamirano; Juan Caballería; Debbie L Shawcross; Samuel B Ho; Alexandre Louvet; Michael R Lucey; Philippe Mathurin; Guadalupe Garcia-Tsao; Ramon Bataller; Xin M Tu; Lars Eckmann; Wilfred A van der Donk; Ry Young; Trevor D Lawley; Peter Stärkel; David Pride; Derrick E Fouts; Bernd Schnabl
Journal: Nature Date: 2019-11-13 Impact factor: 49.962

5. High Rates of Genome Rearrangements and Pathogenicity of Shigella spp.

Authors: Zaira Seferbekova; Alexey Zabelkin; Yulia Yakovleva; Robert Afasizhev; Natalia O Dranenko; Nikita Alexeev; Mikhail S Gelfand; Olga O Bochkareva
Journal: Front Microbiol Date: 2021-04-12 Impact factor: 5.640

6. Horizontal transfer and evolution of wall teichoic acid gene cassettes in Bacillus subtilis.

Authors: Granger Sutton; Gary B Fogel; Bradley Abramson; Lauren Brinkac; Todd Michael; Enoch S Liu; Sterling Thomas
Journal: F1000Res Date: 2021-05-07

7. Identification and Characterization of vB_PreP_EPr2, a Lytic Bacteriophage of Pan-Drug Resistant Providencia rettgeri.

Authors: Jaime L Mencke; Yunxiu He; Andrey A Filippov; Mikeljon P Nikolich; Ashton T Belew; Derrick E Fouts; Patrick T McGann; Brett E Swierczewski; Derese Getnet; Damon W Ellison; Katie R Margulieux
Journal: Viruses Date: 2022-03-29 Impact factor: 5.818

8. A pan-genome method to determine core regions of the Bacillus subtilis and Escherichia coli genomes.

Authors: Granger Sutton; Gary B Fogel; Bradley Abramson; Lauren Brinkac; Todd Michael; Enoch S Liu; Sterling Thomas
Journal: F1000Res Date: 2021-04-13

8 in total