| Literature DB >> 29026660 |
Andrew J Page1, Nabil-Fareed Alikhan2, Heather A Carleton3, Torsten Seemann4, Jacqueline A Keane5, Lee S Katz3,6.
Abstract
Multi-locus sequence typing (MLST) is a widely used method for categorizing bacteria. Increasingly, MLST is being performed using next-generation sequencing (NGS) data by reference laboratories and for clinical diagnostics. Many software applications have been developed to calculate sequence types from NGS data; however, there has been no comprehensive review to date on these methods. We have compared eight of these applications against real and simulated data, and present results on: (1) the accuracy of each method against traditional typing methods, (2) the performance on real outbreak datasets, (3) the impact of contamination and varying depth of coverage, and (4) the computational resource requirements.Entities:
Keywords: MLST; multi-locus sequence typing; next-generation sequencing; software comparison
Mesh:
Year: 2017 PMID: 29026660 PMCID: PMC5610716 DOI: 10.1099/mgen.0.000124
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Overview of MLST software
| Reads | Assembly | GPL3 | GitHub | Yes | Pip, Apt, Docker | Command line | |
| BigsDB [ | Contigs | GPL3 | GitHub | No | Manual | Website | |
| BioNumerics | Reads/ contigs | Proprietary/ | Bespoke | Proprietary | Manual | GUI | |
| EnteroBase | Reads | Website | |||||
| Reads | Mapping | FreeBSD | GitHub | No | Manual | Command line | |
| mlst* | Contigs | GPL2 | GitHub | No | Brew | Command line | |
| Contigs | Apache 2 | Bitbucket | No | Docker | Command line/Website | ||
| MLSTcheck [ | Contigs | GPL3 | GitHub | Yes | CPAN, Docker | Command line | |
| SeqSphere+ [ | Contigs | Bespoke | Proprietary | Manual | GUI | ||
| SRST2 (24) | Reads | Mapping | BSD | GitHub | Yes | Apt, pip | Command line |
| stringMLST [ | Reads | Bespoke | GitHub | No | Manual | Command line |
*https://github.com/tseemann/mlst
Overview of the MLST databases available with each software application.
| Yes | 0 | – | Yes | |
| BioNumerics | Yes | 0 | – | Yes |
| Yes | 125 | 1 month | Yes | |
| MLSTcheck | Yes | 0 | – | Yes |
| No | 6 | >1 year | Yes | |
| SeqSphere+ | Yes | 0 | – | Yes |
| Yes | 0 | – | Yes | |
| stringMLST | Yes | 128 | 1 month | Yes |
DB, Database.
*The age of the bundled databases was calculated on the 15 March 2017.
Summary of performance of each algorithm on real outbreak data for four different species (85 samples)
| Software | Total time (min) | Correct ST (%) | No call/low confidence (%) |
|---|---|---|---|
| 109.5 | 98.8 | 1.2 | |
| BioNumerics | |||
| mlst* | 1.9 (+2873) | 96.5 | 3.5 |
| 1189.7 | 49.4 | 50.6 | |
| MLSTcheck* | 63.8 (+2873) | ||
| SeqSphere+ | 96.5 | 3.5 | |
| 2380.2 | 95.3 | 4.7 | |
| stringMLST |
Values in bold indicate the best results in each column.
*The time to assemble with SPAdes before running the applications was 2873 min and is included separately.
†most identified the correct ST in 97.6 % of cases, but flagged 48.2 % of these calls as low confidence.
Fig. 1.Number of correct calls of each application as coverage increases. Each ST consists of seven alleles, and all seven must be correctly and confidently called to calculate a ST.
Fig. 2.Running time (s) of each application as the coverage increases to assess the impact of the depth of coverage. No assembled contiguous sequences could be generated where the coverage was less than 7×, as such no data was recorded for the reliant methods (mlst and MLSTcheck). No performance results are available for BioNumerics or SeqSphere+.
Fig. 3.Disk space requirements in bytes for each software application as the depth of coverage increases. Due to the large difference between applications, a log scale is used.
Fig. 4.STs called by each software application when given data containing two different Salmonella samples in varying ratios of abundance. Where there is no ST called, or where the ST has any ambiguity at all, it is marked as low confidence. A false positive is where an ST is called with high confidence and is not one of the two samples in the raw data.