Literature DB >> 22238260

PHACTS, a computational approach to classifying the lifestyle of phages.

Katelyn McNair¹, Barbara A Bailey, Robert A Edwards.

Abstract

MOTIVATION: Bacteriophages have two distinct lifestyles: virulent and temperate. The virulent lifestyle has many implications for phage therapy, genomics and microbiology. Determining which lifestyle a newly sequenced phage falls into is currently determined using standard culturing techniques. Such laboratory work is not only costly and time consuming, but also cannot be used on phage genomes constructed from environmental sequencing. Therefore, a computational method that utilizes the sequence data of phage genomes is needed.
RESULTS: Phage Classification Tool Set (PHACTS) utilizes a novel similarity algorithm and a supervised Random Forest classifier to make a prediction whether the lifestyle of a phage, described by its proteome, is virulent or temperate. The similarity algorithm creates a training set from phages with known lifestyles and along with the lifestyle annotation, trains a Random Forest to classify the lifestyle of a phage. PHACTS predictions are shown to have a 99% precision rate.
AVAILABILITY AND IMPLEMENTATION: PHACTS was implemented in the PERL programming language and utilizes the FASTA program (Pearson and Lipman, 1988) and the R programming language library 'Random Forest' (Liaw and Weiner, 2010). The PHACTS software is open source and is available as downloadable stand-alone version or can be accessed online as a user-friendly web interface. The source code, help files and online version are available at http://www.phantome.org/PHACTS/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Mesh：

Year: 2012 PMID： 22238260 PMCID： PMC3289917 DOI： 10.1093/bioinformatics/bts014

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

Viruses that infect bacteria are called bacteriophages or phages. It is estimated that there are 1030 bacterial cells in biosphere (Whitman ). Given that typical ratios of bacteria to phage are on the order of 1:10 (Wommack and Colwell, 2000), it is estimated that there exist 1031 phage particles on the planet. Viruses thus are the most abundant biological entities on the planet. Phages are ubiquitous and can be found in any environment where their bacterial hosts are present. Phages are found in high numbers in terrestrial environments such as soil, and in aquatic environments such as lakes and seawater (Srinivasiah ). Recent estimates suggest that there exist globally ~100 million phage species (Rohwer, 2003); however, only a small fraction of phages have so far been characterized. When a phage infects a bacterial cell, the phage enters into one of two distinct lifestyles: virulent or temperate. During a virulent lifestyle a phage infects a bacteria; its genome is replicated many times; and the newly created copies are released into the surrounding environment through lysis, extrusion or budding. In contrast during a temperate lifestyle, a phage infects a bacteria and either integrates its DNA into the bacterial genome or re-circularizes its DNA into a stable plasmid. The temperate phage will live in this semi-stable lifestyle as a prophage as the host bacteria continues to grow and divide. The prophage will be carried through future bacterial cell divisions until appropriate environmental conditions cause the temperate phage to enter into a virulent lifestyle and release itself from the host bacterium. This switch into a virulent lifestyle is referred to as induction and is generally caused by host cell damage (Witkin, 1976) or environmental stressors (Clarke, 1998; Clark ). Not only does the characterization of phage lifestyles contribute to the understanding of phage population dynamics, genomics and microbiology; but also the virulent lifestyle has applications toward phage therapy and biocontrol (Housby and Mann, 2009). Previously, the lifestyle of a phage was identified through culturing and isolation in the lab. This is not only time consuming but also costly. With the advent of shotgun sequencing, large numbers of phage are being sequenced at an increasing rate. As the ability to sequence new phages faster than culturing can identify the lifestyle, there is a need to computationally annotate genomic data and also to make predictions about the lifestyle. In addition, because many of these newly sequenced genomes are derived from entire environmental community sequencing methods, it may not be possible to isolate the phages for culturing. Computationally classifying phages based on their genomes is difficult due to the highly mosaic organization of their genomes (Hendrix ). Unlike bacteria, which have 16S rRNA and various other conserved genes that can be used for taxonomy and phylogeny, phages have no universally present gene that can be used for analysis (Rohwer and Edwards, 2002). The first attempt at using genomic data to classify phage by comparing structural proteins does not work well across all clades of phages (Proux ). An alternative methodology was created by Rohwer and Edwards, and was used to create the Phage Proteomic Tree (Rohwer and Edwards, 2002). To deal with the mosaicism of phages, Lima-Mendez ) implemented a framework for a reticulate classification based on gene content, by building a weighted graph where nodes represent phages and edges represent shared gene/protein similarities. Recently, this reticulate classification was extended to shared evolutionarily conserved modules consisting of groups of proteins that have a similar phylogenetic profile (Lima-Mendez ). Certain modules were found to be associated with either temperate or virulent phages, and it was suggested that a refining of the methodology might be used for an automated classification of phage lifestyle. An alternative method uses the tetranucleotide frequency differences between a phage and host to classify the lifestyle of the phage (Deschavanne ); however, this method is severely limited by the necessity to have a phages' host fully sequenced. In this work, a Phage Classification Tool Set (PHACTS) was developed to classify whether a phage's preferred lifestyle is virulent or temperate. PHACTS utilizes a novel similarity algorithm and a supervised Random Forest classifier to make a prediction whether the lifestyle of a phage is virulent or temperate. The similarity algorithm creates a training set from phages with known lifestyles that, along with the lifestyle annotation, is used to train a Random Forest to classify the lifestyle of a phage. To test the accuracy of PHACTS, each phage with an annotated lifestyle was removed from the database one at a time and treated as a single phage with an unknown lifestyle. The lifestyle of the phage was predicted using PHACTS and the predicted lifestyle was then compared with the actual lifestyle.

2 METHODS

2.1 Implementation

2.1.1 Lifestyle database

At the time of this work, the PHANTOME database of phages with complete genomes contained 654 phages (www.phantome.org). The lifestyles for 227 of these phages were manually curated by hand from various literature sources. In this subset of 227 phages with a known lifestyle, there were 148 temperate phages and 79 virulent phages, and thus temperate phages predominated the database 2:1. These phages with a known lifestyle were used to create a local database for use during PHACTS classifications.

2.1.2 Query proteins

A set of query protein sequences Q={P1, P2,…, P}, is created by randomly selecting M proteins, where M is equal to the user-specified number of proteins to use for creating the training set. From each class, M/C proteins are selected at random that belong to phages of that class, where C is equal to the number of classes in the training set. For our experiments, it was empirically found that M=600 gave the best results. When M was decreased the accuracy went down, and when M was increased the runtimes went up without a corresponding increase in accuracy.

2.1.3 Training sets

To create the training set for the Random Forest classifier, a set of N similarity vectors is assembled, where N is equal to the number of phages to use as training cases. From each class, N/C phage genomes are selected at random, without replacement. From these N phages the list L={G1, G2,…, G} is created. The class with the fewest number of representative samples limits how many training cases can be used. For our purposes, it was empirically found that N=100 gave the best results. Having 50 phages per class was adequate to provide accurate results as well as allowing for a diverse random sampling. For each of these N genomes, a similarity vector X is assembled. The proteins of a phage are aligned against every protein in Q using the FASTA program. The percent identity score for each protein in that phage's proteome to the protein P is calculated as a percent identity corresponding to the highest scoring pair S. This percent identity score S is inserted into the similarity vector X, as shown below. The manually curated lifestyles of the phages are retrieved from the locally stored database and are used as the classification factors.

2.1.4 Testing set

The proteins of the input phage proteome are aligned against each protein in Q using the FASTA35 program. The percent identity score for each protein in the input phage's proteome to the protein P is calculated. The percent identity corresponding to the highest scoring pair is inserted into a vector X. A single similarity vector is assembled for the input phage's proteome as shown below. This vector becomes the testing set, and the Random Forest ensemble classifier is used to predict the lifestyle.

2.1.5 Random Forest

To classify the testing set, PHACTS utilizes the Random Forest algorithm. In the Random Forest classifier, a set of decision trees is created. For each tree, bootstrapping is performed by selecting N cases with replacement from the training set of N cases. Each tree is grown by randomly selecting m number of variables at each node, where m is equal to the square root of the total number of variables. The best split at that node is calculated from these m variables, and the tree is grown to the largest extent possible. Each tree predicts a lifestyle and the final prediction is a majority-voting rule for the trees in the Random Forest. Random Forest also returns information on the voting as a percentage that corresponds to the number of trees that predicted a particular lifestyle divided by the total number of trees. Since the Random Forest algorithm does not overfit the data, large numbers of trees can be created. For our predictions, 1001 trees were created to provide enough coverage of the variable training set. In a Random Forest classification, a value in the form of a probability is output for each lifestyle. This value corresponds to the fraction of trees in the Random Forest that predict that particular lifestyle, thus the values vary from 0 to 1. The lifestyle with the higher probability is considered to be the predicted lifestyle for that phage.

2.1.6 Replicate iterations

The resulting prediction from a single Random Forest calculation is based on N known phages, which are randomly selected as training cases, and M proteins, which are randomly chosen to create the Similarity Vectors. Because of this random selection of training data, an unknown phage might be predicted as a different lifestyle in each subsequent Random Forest classification. To better account for this variability in predictions, 10 replicates are performed with different training phages and a different set of Query Proteins. Ten replicates are chosen to balance runtime and accuracy. Predictions based on five replicates were less accurate, whereas predictions based on 20 replicates caused runtimes to greatly increase without a concomitant increase in accuracy. The 10 replicate predictions are averaged, and the lifestyle with the higher average is considered the predicted lifestyle of the phage. For some phages, the replicate predictions of which lifestyle they prefer might vary, with some of the replicate predictions voting for one lifestyle and some replicate predictions voting for the other lifestyle. The distribution of these predictions was calculated to be a normal distribution. The final probability score is considered ‘confident’ if a consensus of the 10 replicate predictions is for one particular lifestyle. To determine whether a prediction was confident, the mean and the SD of the 10 replicate predictions is calculated. The prediction is deemed ‘confident’ if the averaged probability score of the predicted lifestyle is 2 SD away from the averaged probability score of the other lifestyle.

2.1.7 Initialization

Not all proteins are useful in identifying the class of a phage. To increase the accuracy of predictions, an importance cutoff value was incorporated to include only proteins that are important toward predicting a phage's class into the creation of the set of Query Proteins Q. A similarity vector is created for each temperate and virulent phage. This set of similarity vectors is used by the Random Forest algorithm to calculate the Gini importance values (also known as the Gini-coefficient) for all the proteins in the database that belong to the phage with an annotated lifestyle. The Gini importance value is a measure of how important a protein is toward classifying a phage's lifestyle (Gini, 1912). A Gini value of zero corresponds to perfect equality (unimportant) and a value of one corresponds to perfect inequality (important). This step is only performed when any new phages, and thus new proteins, are added to the database. This importance value for a protein is used during runtime so that only the most important proteins are selected to create the similarity vectors. To empirically determine which proteins to include into PHACTS calculations, the importance cutoff value was set to various percentages at and above the mean, and the 227 phages were classified using these various importance value cutoff values. It was found that an importance cutoff value of twice the mean of the importance values, gave the best results for our dataset, and by excluding less important proteins both the speed and the accuracy increased. To speed up runtime, the percent identity scores of every protein to every other protein are calculated at initialization by the FASTA program, and results are stored in a data structure for optimized retrieval.

2.2 Partial genomes

Datasets were created that consisted of partial proteomes of various sizes. The first dataset contained 1000 partial proteomes that consisted of a single protein. Six more datasets were created by increasing the size of the partial proteomes in increments of five proteins until the final partial proteome dataset consisted of 1000 partial proteomes of 30 proteins. Testing partial proteomes >30 proteins causes a bias, since phages with small genomes become excluded. Each proteome was created by randomly choosing with replacement a phage with a known lifestyle and then randomly selecting a set of contiguous proteins in that phage. The partial proteomes were then used by PHACTS to predict the lifestyle of the phage. Accuracy scores were calculated by dividing the number of confident correct predictions by the total number of confident predictions.

3 RESULTS

3.1 Accuracy of the lifestyle predictions of PHACTS

To test the efficacy of PHACTS toward classifying a phage's lifestyle, each phage with an annotated lifestyle was sequentially removed from the known database, along with any phages that share >90% of their proteins with >90% percent identity, and PHACTS was used to predict its lifestyle. The predicted lifestyle was compared with the actual annotated lifestyle. Out of the 227 phages with a known lifestyle, PHACTS was able to confidently calculate the lifestyle of 199 phages (Fig. 1). The other 28 phages gave variable results, sometime replicates being classified as virulent and other times as temperate. Out of the 199 predictions that were confident, 197 of those predictions were correct, giving PHACTS a precision rate of 99% and sensitivity of 88%, for predicting the lifestyle of a phage. The results for each phage prediction, along with the SD, are listed in Supplementary Table S1.

Fig. 1.

Accuracy of PHACTS predictions when classifying the lifestyle of the 227 phages with known lifestyles. A confident classification is where the averaged replicate predictions are >2 SD apart.

Accuracy of PHACTS predictions when classifying the lifestyle of the 227 phages with known lifestyles. A confident classification is where the averaged replicate predictions are >2 SD apart. The two phages that were consistently classified incorrectly were the Mycobacteriophage D29 (28369.1) and the Lactococcal bacteriophage ul36 (114416.1). To find out the reason for the incorrect predictions of D29 and ul36, the genomes of these virulent double-stranded DNA phages were analyzed. Both phages contain an integrase gene, and both of these integrases are indeed functional (Peña ; Labrie and Moineau, 2002). The fact that a virulent phage contains a functional integrase is counter to the current idea that only temperate phages contain integrase. In the case of the Mycobacteriophage D29, a truncated repressor gene that is necessary for temperate proliferation is the cause of the strictly virulent lifestyle (Peña ), whereas horizontal gene transfer seems to be responsible for the presence of the integrase in the Lactococcal bacteriophage ul36 (Labrie and Moineau, 2002). The reason that the lifestyle of 28 phages could not be predicted confidently was not as straightforward, but most likely, arises by a query phage having low similarity to phages with known lifestyles in the database. To determine how the function of a protein correlated to the importance that a protein had on a prediction, the functional role was found for every protein in the Query Protein selection pool from the PHANTOME website (www.phantome.org). Proteins were grouped according to lifestyle, and for each functional role a percent importance value was calculated by summing the Gini importance scores for proteins in that functional role and dividing by the total number of proteins in all functional roles (Fig. 2). Even though a large percentage of the proteins have unknown function, it is clearly visible that Integration/Excision/Lysogeny, Regulation of Expression and Toxins genes are predominantly important toward classiying temperate phages, whereas Nucleotide Metabolism, Phage Lysis and Structural Proteins are predominantly important toward classifing virulent phages. The fact that Structural Proteins are one of the most important functional roles for clasifying both temperate and virulent phages shows that by utilizing sequence similaity, PHACTS is able to distinguish between temperate phage proteins and virulent phage proteins even if they share similar functions. These important proteins were compared with the evolutionarily conserved modules found by Lima-Mendez ) to be associated with a specific lifestyle, and the same correlation between module 1 and virulent phages, and module 17 and temperate phages was observed (Supplementary Fig. S1).

Fig. 2.

The correlation between the protein function and the importance toward lifestyle predictions. Phage functional modules are proteins that have functions that are unique to phages, such as capsid assembly or phage DNA packaging.

3.2 Classification of partial genomes

PHACTS has been shown to be highly accurate for classifying the lifestyle of complete phage genomes. However, often times only partial genomes are sequenced. To determine how accurate PHACTS predictions are when incomplete proteomes are used, lifestyle predictions were made for phages using only partial proteomes. It was found that with only 20 proteins, PHACTS can identify the lifestyle of a phage with ~90% precision rate (Fig. 3). The median number of proteins per phage genome in the database was 57 proteins, which suggests that at least a third of a phage's proteome is needed to accurately predict the lifestyle of a phage.

Fig. 3.

The effect of incomplete phage proteomes on the accuracy of PHACTS lifestyle predictions.

3.3 Classification of unknown phages

The lifestyle of each phage in the database that did not have an annotated lifestyle was predicted by PHACTS using the same methodology as above, but without excluding any phages from the training set (Supplementary Table S1). Out of the 417 phages, PHACTS was able to confidently predict the lifestyles of 217 phages, giving this dataset a specificity of <51%. This drop in specificity suggests that these phages without an annotated lifestyle are more diverse than the subset of phages with a known lifestyle. Also of note was the fact the ratio of phages predicted temperate to phages predicted virulent in this dataset was ~1:1, which is different from the ratio of 2:1 observed in the set of phages with annotated lifestyles.

4 CONCLUSIONS AND FUTURE WORK

PHACTS provides a mechanism to determine the lifestyle of a phage without having to perform costly and time-consuming experimental lab techniques. PHACTS predictions were shown to have a 99% precision rate, and PHACTS can also determine the lifestyle of a phage using only genomic data, which previously could not be done. One of the limitations of PHACTS currently is that for a small percentage of phages, a confident lifestyle prediction cannot be made. This is primarily caused by the variability and that arises from the random sampling during classifications. If an unknown phage does not have any similarity to phages with known lifestyles in the database, predictions will be less certain. It is expected that as more phages with known lifestyles are added to the database, the precision rate and sensitivity of predictions will increase. The web version is simple and easy to use, and the stand-alone version allows for user customization and alternate training sets. The application of PHACTS on different classification schemes (Gram-stain of host and phage Family) has been shown to be moderately successful (data not shown). In the future, refinements to the methodology may lead to high precision rates when classifying the Gram stain of host and phylogenetic Family of phages, as well as other novel classification schemes.

15 in total

Review 1. Virioplankton: viruses in aquatic ecosystems.

Authors: K E Wommack; R R Colwell
Journal: Microbiol Mol Biol Rev Date: 2000-03 Impact factor: 11.056

Review 2. Global phage diversity.

Authors: Forest Rohwer
Journal: Cell Date: 2003-04-18 Impact factor: 41.582

3. The Phage Proteomic Tree: a genome-based taxonomy for phage.

Authors: Forest Rohwer; Rob Edwards
Journal: J Bacteriol Date: 2002-08 Impact factor: 3.490

4. Evolutionary relationships among diverse bacteriophages and prophages: all the world's a phage.

Authors: R W Hendrix; M C Smith; R N Burns; M E Ford; G F Hatfull
Journal: Proc Natl Acad Sci U S A Date: 1999-03-02 Impact factor: 11.205

5. Mycobacteriophage D29 integrase-mediated recombination: specificity of mycobacteriophage integration.

Authors: C E Peña; J Stoner; G F Hatfull
Journal: Gene Date: 1998-12-28 Impact factor: 3.688

6. Improved tools for biological sequence comparison.

Authors: W R Pearson; D J Lipman
Journal: Proc Natl Acad Sci U S A Date: 1988-04 Impact factor: 11.205

7. A modular view of the bacteriophage genomic space: identification of host and lifestyle marker modules.

Authors: Gipsi Lima-Mendez; Ariane Toussaint; Raphael Leplae
Journal: Res Microbiol Date: 2011-06-28 Impact factor: 3.992

Review 8. Ultraviolet mutagenesis and inducible DNA repair in Escherichia coli.

Authors: E M Witkin
Journal: Bacteriol Rev Date: 1976-12

9. The dilemma of phage taxonomy illustrated by comparative genomics of Sfi21-like Siphoviridae in lactic acid bacteria.

Authors: Caroline Proux; Douwe van Sinderen; Juan Suarez; Pilar Garcia; Victor Ladero; Gerald F Fitzgerald; Frank Desiere; Harald Brüssow
Journal: J Bacteriol Date: 2002-11 Impact factor: 3.490

Review 10. Prokaryotes: the unseen majority.

Authors: W B Whitman; D C Coleman; W J Wiebe
Journal: Proc Natl Acad Sci U S A Date: 1998-06-09 Impact factor: 11.205

95 in total

1. PPR-Meta: a tool for identifying phages and plasmids from metagenomic fragments using deep learning.

Authors: Zhencheng Fang; Jie Tan; Shufang Wu; Mo Li; Congmin Xu; Zhongjie Xie; Huaiqiu Zhu
Journal: Gigascience Date: 2019-06-01 Impact factor: 6.524

2. Bi- and Multi-directional Gene Transfer in the Natural Populations of Polyvalent Bacteriophages, and Their Host Species Spectrum Representing Foodborne Versus Other Human and/or Animal Pathogens.

Authors: Ekaterine Gabashvili; Saba Kobakhidze; Stylianos Koulouris; Tobin Robinson; Mamuka Kotetishvili
Journal: Food Environ Virol Date: 2021-01-23 Impact factor: 2.778

3. Standardized bacteriophage purification for personalized phage therapy.

Authors: Tiffany Luong; Ann-Charlott Salabarria; Robert A Edwards; Dwayne R Roach
Journal: Nat Protoc Date: 2020-07-24 Impact factor: 13.491

4. Complete genome sequence of Buttiauxella phage vB_ButM_GuL6.

Authors: Algirdas Noreika; Rolandas Meškys; Justas Lazutka; Laura Kaliniene
Journal: Arch Virol Date: 2020-08-14 Impact factor: 2.574

5. Characterization and complete genome sequence analysis of Staphylococcus aureus bacteriophage JS01.

Authors: Hongying Jia; Wenyang Dong; Lvfeng Yuan; Jiale Ma; Qinqin Bai; Zihao Pan; Chengping Lu; Huochun Yao
Journal: Virus Genes Date: 2015-02-17 Impact factor: 2.332

6. Distinct Biological Potential of Streptococcus gordonii and Streptococcus sanguinis Revealed by Comparative Genome Analysis.

Authors: Wenning Zheng; Mui Fern Tan; Lesley A Old; Ian C Paterson; Nicholas S Jakubovics; Siew Woh Choo
Journal: Sci Rep Date: 2017-06-07 Impact factor: 4.379

7. Lysogeny in nature: mechanisms, impact and ecology of temperate phages.

Authors: Cristina Howard-Varona; Katherine R Hargreaves; Stephen T Abedon; Matthew B Sullivan
Journal: ISME J Date: 2017-03-14 Impact factor: 10.302

8. wksl3, a New biocontrol agent for Salmonella enterica serovars enteritidis and typhimurium in foods: characterization, application, sequence analysis, and oral acute toxicity study.

Authors: Hyun-Wol Kang; Jae-Won Kim; Tae-Sung Jung; Gun-Jo Woo
Journal: Appl Environ Microbiol Date: 2013-01-18 Impact factor: 4.792

9. Characterization and complete genome sequence analysis of Staphylococcus aureus bacteriophage SA12.

Authors: Yoonjee Chang; Ju-Hoon Lee; Hakdong Shin; Sunggi Heu; Sangryeol Ryu
Journal: Virus Genes Date: 2013-06-18 Impact factor: 2.332

10. High-level diversity of tailed phages, eukaryote-associated viruses, and virophage-like elements in the metaviromes of antarctic soils.

Authors: Olivier Zablocki; Lonnie van Zyl; Evelien M Adriaenssens; Enrico Rubagotti; Marla Tuffin; Stephen Craig Cary; Don Cowan
Journal: Appl Environ Microbiol Date: 2014-08-29 Impact factor: 4.792