| Literature DB >> 29319471 |
Pedro Feijao1, Hua-Ting Yao2, Dan Fornika3, Jennifer Gardy4, William Hsiao5, Cedric Chauve6, Leonid Chindelevitch1.
Abstract
MLST (multi-locus sequence typing) is a classic technique for genotyping bacteria, widely applied for pathogen outbreak surveillance. Traditionally, MLST is based on identifying sequence types from a small number of housekeeping genes. With the increasing availability of whole-genome sequencing data, MLST methods have evolved towards larger typing schemes, based on a few hundred genes [core genome MLST (cgMLST)] to a few thousand genes [whole genome MLST (wgMLST)]. Such large-scale MLST schemes have been shown to provide a finer resolution and are increasingly used in various contexts such as hospital outbreaks or foodborne pathogen outbreaks. This methodological shift raises new computational challenges, especially given the large size of the schemes involved. Very few available MLST callers are currently capable of dealing with large MLST schemes. We introduce MentaLiST, a new MLST caller, based on a k-mer voting algorithm and written in the Julia language, specifically designed and implemented to handle large typing schemes. We test it on real and simulated data to show that MentaLiST is faster than any other available MLST caller while providing the same or better accuracy, and is capable of dealing with MLST schemes with up to thousands of genes while requiring limited computational resources. MentaLiST source code and easy installation instructions using a Conda package are available at https://github.com/WGS-TB/MentaLiST.Entities:
Keywords: multi-locus sequence typing; next-generation sequencing; pathogen surveillance
Mesh:
Year: 2018 PMID: 29319471 PMCID: PMC5857373 DOI: 10.1099/mgen.0.000146
Source DB: PubMed Journal: Microb Genom ISSN: 2057-5858
Fig. 1.Sketch of a coloured de Bruijn graph with four alleles, each represented by a different colour. The branching nodes are marked in grey, and paths between those nodes correspond to contigs. All nodes of the same contig have the same set of colours.
Fig. 2.Pseudocode for the preprocessing and calling algorithms in MentaLiST.
Fig. 3.Running time for all MLST caller programs on the different schemes. X indicates that there are no results for the caller on the dataset, either because it failed or took more than 24 h. The bars represent the 95 % confidence interval.
Fig. 4.Average number of calling errors from three M. tuberculosis simulated samples, with varying depth of coverage and using the 553 gene ecgMLST scheme. The bars represent the 95 % confidence interval.
Fig. 5.Average number of calling errors of three M. tuberculosis simulated datasets as a function of the proportion of the minor strain, using the 553 gene ecgMLST scheme. The bars represent the 95 % confidence interval.
Fig. 7.Peak memory usage for all MLST callers on the different schemes. X indicates that there are no results for the caller on the dataset, either because it failed or took more than 24 h. The bars represent the 95 % confidence interval.