| Literature DB >> 26198102 |
Andrew J Page1, Carla A Cummins1, Martin Hunt1, Vanessa K Wong2, Sandra Reuter3, Matthew T G Holden4, Maria Fookes1, Daniel Falush5, Jacqueline A Keane1, Julian Parkhill1.
Abstract
UNLABELLED: A typical prokaryote population sequencing study can now consist of hundreds or thousands of isolates. Interrogating these datasets can provide detailed insights into the genetic structure of prokaryotic genomes. We introduce Roary, a tool that rapidly builds large-scale pan genomes, identifying the core and accessory genes. Roary makes construction of the pan genome of thousands of prokaryote samples possible on a standard desktop without compromising on the accuracy of results. Using a single CPU Roary can produce a pan genome consisting of 1000 isolates in 4.5 hours using 13 GB of RAM, with further speedups possible using multiple processors.Entities:
Mesh:
Year: 2015 PMID: 26198102 PMCID: PMC4817141 DOI: 10.1093/bioinformatics/btv421
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Accuracy of each pan genome application on a dataset of simulated data
| Core genes | Total genes | Incorrect split | Incorrect merge | |
|---|---|---|---|---|
| Expected | 994 | 1017 | 0 | 0 |
| PGAP | 991 | 1012 | 0 | 4 |
| PanOCT | 993 | 1015 | 1 | 1 |
| LS-BSR | 974 | 994 | 0 | 23 |
| Roary | 994 | 1017 | 0 | 0 |
Fig. 1.Effect of dataset size on the wall time of multiple applications. Only analysis that completed within 2 days and 60 GB of RAM is shown
Comparison of pan genome applications using real S.typhi data (ERP001718)
| Samples | Software | Core | Total | RAM (mb) | Wall time (s) |
|---|---|---|---|---|---|
| 8 | PGAP | 4545 | 4929 | 569 | 41 397 |
| PanOCT | 4544 | 4936 | 663 | 1457 | |
| LS-BSR | 4476 | 4816 | 270 | 2585 | |
| Roary | 4459 | 4871 | 156 | 44 | |
| 24 | PGAP | — | — | — | — |
| PanOCT | 4522 | 4991 | 5313 | 96 093 | |
| LS-BSR | 4451 | 4843 | 554 | 7807 | |
| Roary | 4436 | 4941 | 444 | 382 | |
| 1000 | PGAP | — | — | — | — |
| PanOCT | — | — | — | — | |
| LS-BSR | 4272 | 7265 | 17 413 | 345 019 | |
| Roary | 4016 | 9201 | 13 752 | 15 465 |
aCore is defined as a gene being in at least 99% of samples, which allows for some assembly errors in very large datasets. Where there are no results, the applications failed to complete within 5 days or used more than 60 GB of RAM. The first column is the number of unique S.typhi genomes in the input set with a mean of 54 contigs over all 1000 assemblies.