Literature DB >> 26198102

Roary: rapid large-scale prokaryote pan genome analysis.

Andrew J Page¹, Carla A Cummins¹, Martin Hunt¹, Vanessa K Wong², Sandra Reuter³, Matthew T G Holden⁴, Maria Fookes¹, Daniel Falush⁵, Jacqueline A Keane¹, Julian Parkhill¹.

Abstract

UNLABELLED: A typical prokaryote population sequencing study can now consist of hundreds or thousands of isolates. Interrogating these datasets can provide detailed insights into the genetic structure of prokaryotic genomes. We introduce Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and accessory genes. Roary makes construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results. Using a single CPU Roary can produce a pan genome consisting of 1000 isolates in 4.5 hours using 13 GB of RAM, with further speedups possible using multiple processors.
AVAILABILITY AND IMPLEMENTATION: Roary is implemented in Perl and is freely available under an open source GPLv3 license from http://sanger-pathogens.github.io/Roary CONTACT: roary@sanger.ac.uk SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2015 PMID： 26198102 PMCID： PMC4817141 DOI： 10.1093/bioinformatics/btv421

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

The term microbial pan genome was first used in 2005 (Medini ) to describe the union of genes shared by genomes of interest (Vernikos ). Since then, availability of microbial sequencing data has grown exponentially. Aligning whole-genome-sequenced isolates to a single reference genome can fail to incorporate non-reference sequences. By using de novo assemblies, non-reference sequences can also be analyzed. Microbial organisms can rapidly acquire genes from other organisms that can increase virulence or promote antimicrobial drug resistance (Medini ). Gaining a better picture of the conserved genes of an organism, and the accessory genome, can lead to a better understanding of key processes such as selection and evolution. The construction of a pan genome is NP-hard (Nguyen ) with additional difficulties from real data due to contamination, fragmented assemblies and poor annotation. Therefore, any approach must employ heuristics to produce a pan genome (reviewed in Vernikos ). The most complete standalone pan genome tools are PanOCT (Fouts ), which uses a conserved gene neighborhood in addition to homology to accurately place proteins into orthologous clusters; LS-BSR (Sahl ) which uses a preclustering step before running BLAST to rapidly assign genes to families and PGAP which takes annotated assemblies, performs an all-against-all BLAST, clusters the results and produces a pan genome (Zhao ). PanOCT and PGAP require an all-against-all comparison using BLAST, with the running time growing approximately quadratically with the size of input data and are computationally infeasible with large datasets. They also have quadratic memory requirements, quickly exceeding the RAM available in high performance servers for large datasets. LS-BSR introduces a pre-clustering step that makes it an order of magnitude faster than PGAP; however, it is less sensitive (Sahl ). We have developed a method to generate the pan genome of a set of related prokaryotic isolates. It works with thousands of isolates in a computationally feasible time, beginning with annotated fragmented de novo assemblies. We address the computational issues by performing a rapid clustering of highly similar sequences, which can reduce the running time of BLAST substantially, and carefully manage RAM usage so that it increases linearly, both of which make it possible to analyze datasets with thousands of samples using commonly available computing hardware.

2 Description

The input to Roary is one annotated assembly per sample in GFF3 format (Stein, 2013), such as that produced by Prokka (Seemann, 2014), where all samples are from the same species. Coding regions are extracted from the input and converted to protein sequences, filtered to remove partial sequences and iteratively pre-clustered with CD-HIT (Fu ). This results in a substantially reduced set of protein sequences. An all-against-all comparison is performed with BLASTP on the reduced sequences with a user defined percentage sequence identity (default 95%). Sequences are then clustered with MCL (Enright ), and finally, the pre-clustering results from CD-HIT are merged together with the results of MCL. Using conserved gene neighborhood information, homologous groups containing paralogs are split into groups of true orthologs. A graph is constructed of the relationships of the clusters based on the order of occurrence in the input sequences, allowing for the clusters to be ordered and thus providing context for each gene. Isolates are clustered based on gene presence in the accessory genome, with the contribution of isolates to the graph weighted by cluster size. A suite of command line tools is provided to interrogate the dataset providing union, intersection and complement. Full details of the method and outputs are provided in the Supplementary Material.

3 Results

We evaluated the accuracy, running time and memory usage of Roary against three similar standalone pan genome applications. In each case, we performed the analysis using a single processor (AMD Opteron 6272) and provided 60 GB of RAM. We constructed a simulated dataset based on Salmonella enterica serovar Typhi (S.typhi) CT18 (acc. no. AL513382), allowing us to accurately assess the quality of the clustering. We created 12 genomes with 994 identical core genes and 23 accessory genes in varying combinations. All the applications created clusters that are within 1% of the expected results, with Roary correctly building all genes as shown in Table 1. The overlap of the clusters is virtually identical in all applications except LS-BSR, which over clusters in 2% of cases.

Table 1.

Accuracy of each pan genome application on a dataset of simulated data

	Core genes	Total genes	Incorrect split	Incorrect merge
Expected	994	1017	0	0
PGAP	991	1012	0	4
PanOCT	993	1015	1	1
LS-BSR	974	994	0	23
Roary	994	1017	0	0

Accuracy of each pan genome application on a dataset of simulated data In addition, a set of 1000 real annotated assemblies of S.typhi genomes was used. Subsets of the data were provided to each application, and the running time and memory usage were noted. The running time of PGAP and PanOCT increases substantially, making only small datasets computationally feasible (Fig. 1 and Supplementary Figs S1–S8). Roary scales consistently as more samples are added (Supplementary Figs S1–S8) and has been shown to work on a dataset of 1000 isolates as shown in Table 2. The memory usage of PGAP and PanOCT also increases rapidly as more samples are added, quickly exceeding 60 GB for even small datasets. The memory usage of Roary scales consistently as more samples are added, making it feasible to process large datasets on a standard desktop computer within a few hours. We conducted similar experiments with more diverse datasets including Streptococcus pneumonia, Staphylococcus aureus and Yersinia enterocolitica and the results exhibit similar speed-ups as shown in Supplementary Figures S7 and S8. The performance in a multi-processor environment is shown in Supplementary Figs S11 and S12, with Roary achieving a speedup of 3.7X using 8 CPUs and GNU Parallel (Tang, 2011).

Fig. 1.

Effect of dataset size on the wall time of multiple applications. Only analysis that completed within 2 days and 60 GB of RAM is shown

Table 2.

Comparison of pan genome applications using real S.typhi data (ERP001718)

Samples	Software	Core^a	Total	RAM (mb)	Wall time (s)
8	PGAP	4545	4929	569	41 397
	PanOCT	4544	4936	663	1457
	LS-BSR	4476	4816	270	2585
	Roary	4459	4871	156	44
24	PGAP	—	—	—	—
	PanOCT	4522	4991	5313	96 093
	LS-BSR	4451	4843	554	7807
	Roary	4436	4941	444	382
1000	PGAP	—	—	—	—
	PanOCT	—	—	—	—
	LS-BSR	4272	7265	17 413	345 019
	Roary	4016	9201	13 752	15 465

aCore is defined as a gene being in at least 99% of samples, which allows for some assembly errors in very large datasets. Where there are no results, the applications failed to complete within 5 days or used more than 60 GB of RAM. The first column is the number of unique S.typhi genomes in the input set with a mean of 54 contigs over all 1000 assemblies.

Effect of dataset size on the wall time of multiple applications. Only analysis that completed within 2 days and 60 GB of RAM is shown Comparison of pan genome applications using real S.typhi data (ERP001718) aCore is defined as a gene being in at least 99% of samples, which allows for some assembly errors in very large datasets. Where there are no results, the applications failed to complete within 5 days or used more than 60 GB of RAM. The first column is the number of unique S.typhi genomes in the input set with a mean of 54 contigs over all 1000 assemblies.

4 Discussion

We have shown that Roary can construct the pan genomes of large collections of bacterial genomes using a desktop computer, where it was not previously computationally possible with other methods. Further speedups in running time are possible by providing more processors to Roary. On simulated data, Roary is the only application to correctly identify all clusters. This increased accuracy comes from using the context provided by conserved gene neighborhood information. Roary scales well on large real datasets, identifying large numbers of core genes, even in the presence of a varied open pan genome.

8 in total

1. An efficient algorithm for large-scale detection of protein families.

Authors: A J Enright; S Van Dongen; C A Ouzounis
Journal: Nucleic Acids Res Date: 2002-04-01 Impact factor: 16.971

Review 2. The microbial pan-genome.

Authors: Duccio Medini; Claudio Donati; Hervé Tettelin; Vega Masignani; Rino Rappuoli
Journal: Curr Opin Genet Dev Date: 2005-09-26 Impact factor: 5.578

Review 3. Ten years of pan-genome analyses.

Authors: George Vernikos; Duccio Medini; David R Riley; Hervé Tettelin
Journal: Curr Opin Microbiol Date: 2014-12-05 Impact factor: 7.934

4. Prokka: rapid prokaryotic genome annotation.

Authors: Torsten Seemann
Journal: Bioinformatics Date: 2014-03-18 Impact factor: 6.937

5. PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species.

Authors: Derrick E Fouts; Lauren Brinkac; Erin Beck; Jason Inman; Granger Sutton
Journal: Nucleic Acids Res Date: 2012-08-16 Impact factor: 16.971

6. PGAP: pan-genomes analysis pipeline.

Authors: Yongbing Zhao; Jiayan Wu; Junhui Yang; Shixiang Sun; Jingfa Xiao; Jun Yu
Journal: Bioinformatics Date: 2011-11-29 Impact factor: 6.937

7. CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors: Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal: Bioinformatics Date: 2012-10-11 Impact factor: 6.937

8. The large-scale blast score ratio (LS-BSR) pipeline: a method to rapidly compare genetic content between bacterial genomes.

Authors: Jason W Sahl; J Gregory Caporaso; David A Rasko; Paul Keim
Journal: PeerJ Date: 2014-04-01 Impact factor: 2.984

8 in total

1403 in total

1. Municipal Wastewater Surveillance Revealed a High Community Disease Burden of a Rarely Reported and Possibly Subclinical Salmonella enterica Serovar Derby Strain.

Authors: Sabrina Diemert; Tao Yan
Journal: Appl Environ Microbiol Date: 2020-08-18 Impact factor: 4.792

2. The Campylobacter jejuni Oxidative Stress Regulator RrpB Is Associated with a Genomic Hypervariable Region and Altered Oxidative Stress Resistance.

Authors: Ozan Gundogdu; Daiani T da Silva; Banaz Mohammad; Abdi Elmi; Brendan W Wren; Arnoud H M van Vliet; Nick Dorrell
Journal: Front Microbiol Date: 2016-12-26 Impact factor: 5.640

3. Genome Investigation of a Cariogenic Pathogen with Implications in Cardiovascular Diseases.

Authors: Srinivasan Sujitha; Udayakumar S Vishnu; Raman Karthikeyan; Jagadesan Sankarasubramanian; Paramasamy Gunasekaran; Jeyaprakash Rajendhran
Journal: Indian J Microbiol Date: 2019-09-06 Impact factor: 2.461

4. The novel 2016 WHO Neisseria gonorrhoeae reference strains for global quality assurance of laboratory investigations: phenotypic, genetic and reference genome characterization.

Authors: Magnus Unemo; Daniel Golparian; Leonor Sánchez-Busó; Yonatan Grad; Susanne Jacobsson; Makoto Ohnishi; Monica M Lahra; Athena Limnios; Aleksandra E Sikora; Teodora Wi; Simon R Harris
Journal: J Antimicrob Chemother Date: 2016-07-17 Impact factor: 5.790

5. Horsing around: Escherichia coli ST1250 of equine origin harbouring epidemic IncHI1/ST9 plasmid with bla _CTX-M-1 and an operon for short-chain fructooligosaccharides metabolism.

Authors: Adam Valcek; Petra Sismova; Kristina Nesporova; Søren Overballe-Petersen; Ibrahim Bitar; Ivana Jamborova; Arie Kant; Jaroslav Hrabak; Jaap A Wagenaar; Jean-Yves Madec; Peter Damborg; Engeline van Duijkeren; Christa Ewers; Joost Hordijk; Henrik Hasman; Michael S M Brouwer; Monika Dolejska
Journal: Antimicrob Agents Chemother Date: 2021-02-22 Impact factor: 5.191

6. Multifaceted mechanisms of colistin resistance revealed by genomic analysis of multidrug-resistant Klebsiella pneumoniae isolates from individual patients before and after colistin treatment.

Authors: Yan Zhu; Irene Galani; Ilias Karaiskos; Jing Lu; Su Mon Aye; Jiayuan Huang; Heidi H Yu; Tony Velkov; Helen Giamarellou; Jian Li
Journal: J Infect Date: 2019-07-30 Impact factor: 6.072

7. Whole-Genome Sequencing and Comparative Genomics of Three Helicobacter pylori Strains Isolated from the Stomach of a Patient with Adenocarcinoma.

Authors: Montserrat Palau; Núria Piqué; M José Ramírez-Lázaro; Sergio Lario; Xavier Calvet; David Miñana-Galbis
Journal: Pathogens Date: 2021-03-12

8. The dissemination of multidrug-resistant Enterobacter cloacae throughout the UK and Ireland.

Authors: Danesh Moradigaravand; Sandra Reuter; Veronique Martin; Sharon J Peacock; Julian Parkhill
Journal: Nat Microbiol Date: 2016-09-26 Impact factor: 17.745

9. Global phylogeography and evolutionary history of Shigella dysenteriae type 1.

Authors: Elisabeth Njamkepo; Nizar Fawal; Alicia Tran-Dien; Jane Hawkey; Nancy Strockbine; Claire Jenkins; Kaisar A Talukder; Raymond Bercion; Konstantin Kuleshov; Renáta Kolínská; Julie E Russell; Lidia Kaftyreva; Marie Accou-Demartin; Andreas Karas; Olivier Vandenberg; Alison E Mather; Carl J Mason; Andrew J Page; Thandavarayan Ramamurthy; Chantal Bizet; Andrzej Gamian; Isabelle Carle; Amy Gassama Sow; Christiane Bouchier; Astrid Louise Wester; Monique Lejay-Collin; Marie-Christine Fonkoua; Simon Le Hello; Martin J Blaser; Cecilia Jernberg; Corinne Ruckly; Audrey Mérens; Anne-Laure Page; Martin Aslett; Peter Roggentin; Angelika Fruth; Erick Denamur; Malabi Venkatesan; Hervé Bercovier; Ladaporn Bodhidatta; Chien-Shun Chiou; Dominique Clermont; Bianca Colonna; Svetlana Egorova; Gururaja P Pazhani; Analia V Ezernitchi; Ghislaine Guigon; Simon R Harris; Hidemasa Izumiya; Agnieszka Korzeniowska-Kowal; Anna Lutyńska; Malika Gouali; Francine Grimont; Céline Langendorf; Monika Marejková; Lorea A M Peterson; Guillermo Perez-Perez; Antoinette Ngandjio; Alexander Podkolzin; Erika Souche; Mariia Makarova; German A Shipulin; Changyun Ye; Helena Žemličková; Mária Herpay; Patrick A D Grimont; Julian Parkhill; Philippe Sansonetti; Kathryn E Holt; Sylvain Brisse; Nicholas R Thomson; François-Xavier Weill
Journal: Nat Microbiol Date: 2016-03-21 Impact factor: 17.745

10. Genomic Landscape of Ornithobacterium rhinotracheale in Commercial Turkey Production in the United States.

Authors: Emily A Smith; Elizabeth A Miller; Bonnie P Weber; Jeannette Munoz Aguayo; Cristian Flores Figueroa; Jared Huisinga; Jill Nezworski; Michelle Kromm; Ben Wileman; Timothy J Johnson
Journal: Appl Environ Microbiol Date: 2020-05-19 Impact factor: 4.792