Literature DB >> 29036291

Phylotyper: in silico predictor of gene subtypes.

Matthew D Whiteside¹, Victor P J Gannon¹, Chad R Laing¹.

Abstract

SUMMARY: Whole genome sequencing (WGS) is being adopted in public health for improved surveillance and outbreak analysis. In public health, subtyping has been used to infer phenotypes and distinguish bacterial strain groups. In silico tools that predict subtypes from sequences data are needed to transition historical data to WGS-based protocols. Phylotyper is a novel solution for in silico subtype prediction from gene sequences. Designed for incorporation into WGS pipelines, it is a general prediction tool that can be applied to different subtype schemes. Phylotyper uses phylogeny to model the evolution of the subtype and infer subtypes for unannotated sequences. The phylogenic framework in Phylotyper improves accuracy over approaches based solely on sequence similarity and provides useful contextual feedback.
AVAILABILITY AND IMPLEMENTATION: Phylotyper is a python and R package. It is available from: https://github.com/superphy/insilico-subtyping. CONTACT: matthew.whiteside@phac-aspc.gc.ca. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical

Mesh：

Year: 2017 PMID： 29036291 PMCID： PMC5870578 DOI： 10.1093/bioinformatics/btx459

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Whole-genome sequencing (WGS) is transforming the public health field by providing an efficient method for surveying bacterial populations. The speed, discriminatory power and broad utility of WGS can improve surveillance and outbreak analysis. Adoption of WGS in public health, however, requires transitioning of historical data with the new methods (Jenkins, 2015). One of the workhorse methods in public health is subtyping. Subtyping methods can broadly be categorized as phenotype-based or DNA-based. Phenotype-based subtypes are, for example, interrogated by biochemical tests (biotyping), detection of surface antigens (serotyping) or susceptibility to bacteriophage (phagetyping) (Wiedmann, 2002). Alternatively, DNA-based subtyping examines and classifies bacteria based on genetic content. DNA-based subtypes use a variety of methods from PCR and Pulse Field Gel Electrophoresis, to DNA sequencing to assign bacterial isolates subtype designations (Wiedmann, 2002). As a surveillance tool, subtypes provide a clear cut designation that is typically used to distinguish taxonomic groups and infer phenotypes. A WGS-based approach to subtyping would have several benefits over current subtype systems; it would be faster, have improved discrimination and would be cheaper and easier to maintain (Jenkins, 2015). Accordingly, new in silico tools have been developed to predict gene subtypes from WGS data (Carrillo ; Ingle ; Joensen ). These tools predict subtypes (either Phenotype-based or DNA-based systems) by targeting sequence variation in a specific region or gene in the genome. An example of a subtyping system that has been adapted for WGS is serotyping of Escherichia coli. The sequence of O-antigen processing genes in E.coli is used to predict O-antigen group in serotyping (Ingle ; Joensen ). Another example is the Shiga-toxin subtype (Stx); a DNA-based subtyping scheme generated using PCR. The tool in (Carrillo ) predicts Stx subtype by simulating PCR in silico. Phylotyper is a novel in silico predictor of subtypes from sequence data. Similar to Joensen ; Ingle ; Carrillo , it also works on subtypes that can be predicted from specific, pre-selected gene or genomic region sequences. Phylotyper is unique in that it builds a phylogenetic tree consisting of reference sequences with known subtypes and the unknown query sequences to help inform subtype prediction. Using phylogenetic ancestral state reconstruction to assign the likelihood of each subtype to the tree branch points, Phylotyper then assigns an unknown query sequence a subtype based on the extrapolated value from its ancestors in the tree.

2 Implementation

The core of Phylotyper is an ancestral state reconstruction (ASR) method that has been adapted for hidden state prediction (Revell, 2011). In phylogenetic analysis, ancestral state reconstruction involves the prediction of traits of ancestors from existent descendants. This methodology can be extended to also predict properties in a limited number of existing strains. In Phylotyper, the rerootingMethod function from the phytools R package is used to perform the ASR. This function calculates the maximum Bayesian posterior probability for unknown tip nodes in a phylogenetic tree. The likelihood reflects the most likely state for the node given the empirically estimated subtype evolution model and phylogeny. In the context of Phylotyper, the posterior probability provides a confidence value associated with a predicted subtype. Phylotyper is developed in python and R. The steps in the Phylotyper pipeline are: (i) Identify subtype gene loci in input genomes using BLAST (Camacho ). Inputs are in fasta format. Hits that do not align over 95% or have under 90% sequence identity with a reference sequence are discarded. Users are notified if no loci are found in genome. (ii) Align input genes against a pre-aligned set of reference genes using MAFFT’s –add feature (Katoh and Standley, 2013). (iii) If multiple loci are involved, concatenate individual alignments into a superalignment. (iv) Generate maximum likelihood phylogenetic tree of aligned genes with FastTree (Price ). (v) Run phytools rerootingMethod using the phylogenetic tree and assigned subtypes (Revell, 2011). (vi) Identify the subtype with maximum probability for the unknown genes and report to user in text output file. Users are also provided with an image of the phylogenetic tree overlaid with the likelihood values (e.g. Fig. 1).

Fig. 1

Phylogenetic tree for select Stx2 genes. The subtype marginal likelihood is displayed at each node as a pie chart. Subtype is indicated by color as shown in the legend. The full Stx2 tree is displayed in Supplementary Figure S1 Detailed descriptions on how to run Phylotyper are provided here: https://github.com/superphy/insilico-subtyping. Phylotyper was designed to be incorporated into a WGS workflow. The main input into Phylotyper is assembled genome sequences (in fasta format). Putative loci needed for the subtype scheme are identified in the input genomes using BLAST (Camacho ). The identified loci are then sent to the Phylotyper subtype prediction module. It is possible in Phylotyper to use multiple loci for subtype prediction. Individual loci alignments are concatenated to form a single superalignment that is used to build the phylogenetic tree. Currently, the Phylotyper package includes the following subtype schemes for Escherichia coli: Stx, intimin and serotype O- and H-types (Supplementary Table S1). However, the Phylotyper software also has the capability to add new subtype schemes (instructions are provided here: https://github.com/superphy/insilico-subtyping). Creating a new subtype scheme will save the required reference files, allowing newly added schemes to be easily re-run from Phylotyper. To add a new subtype scheme for use in Phylotyper, users require a training set of sequences with assigned subtype whose phylogenetic grouping is predictive of the subtype. Phylotyper assumes that the provided training sequences are (i) homologous, specifically, they are suitable for alignment and phylogenetic reconstruction, (ii) the sequence phylogeny is correlated with the subtype distribution and (iii) the set is representative of the range of sequences that make up all subtypes in the scheme. Checks are built-in to the new pipeline to validate the submitted reference set. Each new subtype is subject to two tests and results are reported to the user. The first test checks that the distribution of inter-patristic phylogenetic distances between instances of the same subtype is both smaller and distinct from subtypes that are different. This test can identify isolated cases of potentially mislabelled subtype genes that are tightly clustered with other subtypes. A second check computes the accuracy measure, F-score, through a leave-one-out cross-validation; a procedure which uses each sequence in the training set as a test input to estimate positive and negative prediction rates (see Supplementary Methods for details). The performance metrics: recall and F-score rapidly decrease as the correlation between the training set phylogeny and subtype distribution decreases (Supplementary Fig. S2). These checks verify that the phylogenetic grouping provided by the training set is predictive of the subtype. There is no check that can confirm the training set covers all subtypes in a particular scheme; Phylotyper can only predict subtypes that are represented in the training set. It is important that the user monitor schemes and update it as gaps are identified. Phylotyper is designed to return a non-significant/undetermined result when encountering an unknown subtype that has no representative in the training set.

3 Results

Phylotyper is a progression from the sequence-similarity approach that is the basis of current in silico subtype prediction strategies. To compare Phylotyper to a sequence similarity-based approach, we ran two validations that looked at how both methods perform when confronted with (i) a gene sequence or, (ii) subtype class not present in the training set. The first validation was a leave-one-out cross-validation test that iterated through each gene in the training set, retraining the prediction tools on a reduced training set that excludes the selected test gene, and then confirming if the retrained predictor could recover the subtype of the test gene. This validation tests how the predictors perform when run on a distinct sequence that is not in the training set. The second validation examined how the predictors perform when tested with a gene that has a subtype not in the reference set. In this validation, each subtype was iterated over and all genes that were assigned the subtype were removed from the training set. In each iteration, we recorded the number of false-positive subtype assignments when the test sequences were used as input. The correct response for the predictors was to return a negative result since the subtype does not exist in the training set. For these assessments, we developed a sequence-similarity based tool that assigns putative subtypes using BLAST. This generalized BLAST tool, based on the approach used in Joensen , assigns a query sequence a subtype when the top BLAST match from an annotated reference database is above a pre-selected percent identity and alignment coverage cutoff. Details how the assessment was conducted are available in Supplementary Methods. The assessment examined the five subtype schemes available in Phylotyper: Stx1, Stx2, Eae, H-type (FliC), O-type (Wzy & Wzx). When tasked with assigning a novel gene sequence not in the training set in the leave-one-out validation, Phylotyper consistently had higher precision than a top-BLAST-hit approach. The average precision in Phylotyper was 0.99 versus 0.96 in the top-BLAST-hit approach (Supplementary Table S2). The BLAST approach also had lower recall rates; it had an average recall of 0.81 compared to 0.90 with Phylotyper. Similarly, when entire subtype classes were withheld from the training set, Phylotyper had consistently lower false positive rates for all subtypes schemes tested; the average false positive rate in this test case was 0.11, while in the BLAST approach, the average false positive rate was 0.30. A separate assessment for the V-typer tool; a Stx subtype predictor, was run using selected Stx gene sequences from the experimentally verified Phylotyper training set (Carrillo ). The test Stx genes had sufficient surrounding DNA sequence to support in silico PCR. In total, 24 Stx gene sequences were tested with the V-typer tool and V-typer returned results for 7, all correct. Phylotyper correctly predicted the subtype for all 24 genes. Based on this level of recall, it appears conditions in the Stx subtype environment are challenging for simulated PCR. All new or updated subtype schemes added in Phylotyper are subject to a leave-one-out cross-validation test. The test is part of the add pipeline and is used to estimate the F-score of the subtype scheme. The F-score reflects the predictive capability of the subtype scheme. If the associated phylogeny for the training set gene sequences is not correlated with the subtype distribution, this will be reflected in the F-score. To demonstrate this property, we randomly assigned subtypes for increasing proportions of the genes in the training set and computed the F-score with the leave-one-out validation for each proportion level. The F-score and recall rapidly decrease as the proportion of randomly altered subtypes increases (Supplementary Fig. S2).

4 Discussion

From assembled WGS data, Phylotyper can assign unclassified genes subtypes. Currently, the Phylotyper software offers subtyping schemes for E.coli. It can, however, be applied to other subtype schemes and Phylotyper includes functionality to build new schemes. Phylotyper can predict subtypes for any input sequence that is strongly correlated with a subtype scheme distribution, however, input sequences with a direct biological role in generating the subtype phenotype will have fewer caveats; A gene sequence that is causal cannot become disassociated from the subtype through recombination or horizontal gene transfer. Outside of E. coli, the PCR-based capsular typing system for Haemophilus influenzae, Neurotoxin serotyping in Clostridium botulinum and the haemagglutinin and neuraminidase types in Influenza A virus are all examples of potential future subtype schemes that we are incorporating into Phylotyper. We plan on expanding the Phylotyper resource by adding and updating high-quality subtype schemes for other pathogens. We encourage users to contact us with their new subtype schemes or updates (https://github.com/superphy/insilico-subtyping). The main strategy currently in use by other in silico tools for predicting subtypes is to use sequence similarity to annotated gene alleles. Query genomes or genes are matched to gene alleles that determine the subtype, or that are correlated with the subtype. For example, SerotypeFinder uses BLAST to find the top matches based on sequence similarity to O-antigen processing genes for in silico O-typing and the flagellin genes for H-typing E. coli genes (Jenkins, 2015). O-type and H-type are transferred from the top matches to the queries provided they are above coverage and percent identity thresholds. This general strategy of allele matching is also applied in the EcOH tool (Ingle ), however, the EcOH tool can directly use unassembled sequence reads as input. The EcOH tool aligns reads to alleles linked to E. coli O-types and H-types, and identifies the top candidates that have an alignment score above pre-defined thresholds. Phylotyper is comparatively more robust as it generates fewer Type-I errors when encountering novel alleles or subtypes not present in the training set. With the allele matching strategy, the reference set make-up can have a greater impact on performance. When alleles or even subtypes are missing in the reference database, the sequence similarity approach more frequently generates false positive predictions. In contrast, Phylotyper computes an empirical model of subtype evolution to predict subtypes for unclassified sequences. By estimating the phylogenetic distribution of each subtype, Phylotyper is less likely to make a Type-I error when encountering a novel subtype or allele. The empirical testing we performed demonstrated this behavior; the rate of false positive classifications was significantly lower than in a sequence-similarity approach in validations where we withheld an allele or an entire subtype from the training set and used it as a test input. V-typer takes a distinct approach; it directly simulates the in vitro wet-lab PCR procedure used to perform Stx subtyping (Carrillo ). V-typer’s direct replication of a wet-lab method in silico means it can only be applied to subtypes schemes that use PCR. Additionally, we found in our evaluation of Stx subtypes that it failed to generate predictions for most test cases. From a methodology standpoint, Phylotyper has an additional benefit over current methods; the phylogenetic framework in Phylotyper provides a statistical likelihood for interpreting results. In comparison, there is no built-in mechanism in the sequence similarity approach to quantify the level of confidence in assigning a subtype to an allele. Subtypes are mainly used as a proxy for evolutionarily related bacterial gene groups or to infer phenotypes. A recent analysis of O-antigen serotypes and their associated O-antigen gene sequences in E.coli found that the sequence data indicated several changes to the organization of the O-groups (DebRoy ). There are potentially other subtype schemes that may show discrepancies between genetic data and subtype grouping. A tool that can evaluate the ability of a genotype to predict a subtype would be better equipped for developing the new subtype schemes or updating current schemes for WGS workflows. The add pipeline in Phylotyper tests subtype schemes for their predictive accuracy by returning an F-score based on a cross-validation assessment. The validation verifies that the phylogeny generated by the training set sequences can be used to predict the subtypes with a high level of accuracy. We showed in empirical tests that the farther a subtype is dissociated from the input gene’s phylogeny, the lower the F-score computed in the Phylotyper add pipeline. In addition to this subtype-level verification, Phylotyper also computes confidence scores for each individual prediction that reflect the rate of subtype change occurring at the input gene’s phylogenetic locale. If an input sequence falls into a region in the phylogenetic tree where the subtype is highly fluid due to rapid evolutionary change or poor subtype-genotype correlation, users would be made aware of this based on a low confidence score. The ability of Phylotyper to inform users about the level of agreement between the subtype assignments and genotype makes it uniquely capable of transitioning historical subtype data to new whole genome sequence-based systems.

Funding

This work is funded in part by a grant from the Public Health Agency of Canada's Genomics Research and Development Initiative (Phase VI, Antimicrobial Resistance Shared Priority Project). Conflict of Interest: none declared. Click here for additional data file.

9 in total

Review 1. Subtyping of bacterial foodborne pathogens.

Authors: Martin Wiedmann
Journal: Nutr Rev Date: 2002-07 Impact factor: 7.110

2. FastTree 2--approximately maximum-likelihood trees for large alignments.

Authors: Morgan N Price; Paramvir S Dehal; Adam P Arkin
Journal: PLoS One Date: 2010-03-10 Impact factor: 3.240

3. Comparative Evaluation of Genomic and Laboratory Approaches for Determination of Shiga Toxin Subtypes in Escherichia coli.

Authors: Catherine D Carrillo; Adam G Koziol; Amit Mathews; Noriko Goji; Dominic Lambert; George Huszczynski; Martine Gauthier; Kingsley Amoako; Burton W Blais
Journal: J Food Prot Date: 2016-12 Impact factor: 2.077

4. Rapid and Easy In Silico Serotyping of Escherichia coli Isolates by Use of Whole-Genome Sequencing Data.

Authors: Katrine G Joensen; Anna M M Tetzschner; Atsushi Iguchi; Frank M Aarestrup; Flemming Scheutz
Journal: J Clin Microbiol Date: 2015-05-13 Impact factor: 5.948

5. Whole-Genome Sequencing Data for Serotyping Escherichia coli-It's Time for a Change!

Authors: Claire Jenkins
Journal: J Clin Microbiol Date: 2015-06-17 Impact factor: 5.948

6. MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors: Kazutaka Katoh; Daron M Standley
Journal: Mol Biol Evol Date: 2013-01-16 Impact factor: 16.240

7. BLAST+: architecture and applications.

Authors: Christiam Camacho; George Coulouris; Vahram Avagyan; Ning Ma; Jason Papadopoulos; Kevin Bealer; Thomas L Madden
Journal: BMC Bioinformatics Date: 2009-12-15 Impact factor: 3.169

8. In silico serotyping of E. coli from short read data identifies limited novel O-loci but extensive diversity of O:H serotype combinations within and between pathogenic lineages.

Authors: Danielle J Ingle; Mary Valcanis; Alex Kuzevski; Marija Tauschek; Michael Inouye; Tim Stinear; Myron M Levine; Roy M Robins-Browne; Kathryn E Holt
Journal: Microb Genom Date: 2016-07-11

9. Comparison of O-Antigen Gene Clusters of All O-Serogroups of Escherichia coli and Proposal for Adopting a New Nomenclature for O-Typing.

Authors: Chitrita DebRoy; Pina M Fratamico; Xianghe Yan; GianMarco Baranzoni; Yanhong Liu; David S Needleman; Robert Tebbs; Catherine D O'Connell; Adam Allred; Michelle Swimley; Michael Mwangi; Vivek Kapur; Juan A Raygoza Garay; Elisabeth L Roberts; Robab Katani
Journal: PLoS One Date: 2016-01-29 Impact factor: 3.240

9 in total

2 in total

1. Spfy: an integrated graph database for real-time prediction of bacterial phenotypes and downstream comparative analyses.

Authors: Kevin K Le; Matthew D Whiteside; James E Hopkins; Victor P J Gannon; Chad R Laing
Journal: Database (Oxford) Date: 2018-01-01 Impact factor: 3.451

2. Multi-Year Persistence of Verotoxigenic Escherichia coli (VTEC) in a Closed Canadian Beef Herd: A Cohort Study.

Authors: Lu Ya Ruth Wang; Cassandra C Jokinen; Chad R Laing; Roger P Johnson; Kim Ziebell; Victor P J Gannon
Journal: Front Microbiol Date: 2018-08-31 Impact factor: 5.640

2 in total