| Literature DB >> 33137102 |
Vito Adrian Cantu1,2, Peter Salamon2,3, Victor Seguritan1, Jackson Redfield4, David Salamon3, Robert A Edwards1,2,4, Anca M Segall1,2,4.
Abstract
For any given bacteriophage genome or phage-derived sequences in metagenomic data sets, we are unable to assign a function to 50-90% of genes, or more. Structural protein-encoding genes constitute a large fraction of the average phage genome and are among the most divergent and difficult-to-identify genes using homology-based methods. To understand the functions encoded by phages, their contributions to their environments, and to help gauge their utility as potential phage therapy agents, we have developed a new approach to classify phage ORFs into ten major classes of structural proteins or into an "other" category. The resulting tool is named PhANNs (Phage Artificial Neural Networks). We built a database of 538,213 manually curated phage protein sequences that we split into eleven subsets (10 for cross-validation, one for testing) using a novel clustering method that ensures there are no homologous proteins between sets yet maintains the maximum sequence diversity for training. An Artificial Neural Network ensemble trained on features extracted from those sets reached a test F1-score of 0.875 and test accuracy of 86.2%. PhANNs can rapidly classify proteins into one of the ten structural classes or, if not predicted to fall in one of the ten classes, as "other," providing a new approach for functional annotation of phage proteins. PhANNs is open source and can be run from our web server or installed locally.Entities:
Mesh:
Substances:
Year: 2020 PMID: 33137102 PMCID: PMC7660903 DOI: 10.1371/journal.pcbi.1007845
Source DB: PubMed Journal: PLoS Comput Biol ISSN: 1553-734X Impact factor: 4.475
Results of per class classification for the test set.
Support indicates the number of test sequences in each specific class. accuracy (fraction of observation correctly classified) is equivalent to the weighted average recall (weighted by the support of each class). The macro average is unweighted (all classes contribute the same).
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0.80 | 0.91 | 0.85 | 2,456 | |
| 0.07 | 0.78 | 0.13 | 81 | |
| 0.69 | 0.75 | 0.72 | 851 | |
| 0.55 | 0.79 | 0.65 | 502 | |
| 0.66 | 0.82 | 0.73 | 1,072 | |
| 0.81 | 0.81 | 0.81 | 5,261 | |
| 0.35 | 0.74 | 0.47 | 648 | |
| 0.97 | 0.93 | 0.95 | 2,031 | |
| 0.51 | 0.86 | 0.64 | 300 | |
| 0.56 | 0.84 | 0.67 | 1,277 | |
| 0.96 | 0.86 | 0.91 | 32,322 | |
| 0.63 | 0.83 | 0.68 | 46,801 | |
| 0.89 | 0.86 (accuracy) | 0.87 | 46,801 |
Summary of previous ML-based methods for classifying viral structural proteins.
| Reference | Method | Target proteins | Database size | Accuracy |
|---|---|---|---|---|
| Seguritan et al.[ | ANN | structural (all viruses) versus non-structural (all viruses) | 6,303 structural | 85.6% |
| 7,500 non-structural | ||||
| Seguritan et al.[ | ANN | capsid versus non-capsid (phages only) | 757 capsid | 91.3% |
| 10,929 non-capsid | ||||
| Seguritan et al.[ | ANN | Tail-associated versus non-tail (phages only) | 2,174 tail | 79.9% |
| 16,881 non-tail | ||||
| Feng et al.[ | Naïve Bayes | structural versus non-structural | 99 structural | 79.15% |
| 208 non-structural | ||||
| Zhang et al.[ | Ensemble Random Forest | structural versus non-structural | 253 structural | 85.0% |
| 248 non-structural | ||||
| Galiez et al.[ | SVM | capsid versus non-capsid | 3,888 capsid | 96.8% |
| 4,071 non-capsid | ||||
| Galiez et al.[ | SVM | tail versus non-tail | 2,574 tail | 89.4% |
| 4,095 non-tail | ||||
| Manavalan et al.[ | SVM | structural versus non-structural | 129 structural | 87.0% |
| 272 non-structural | ||||
| This work | ANN | Ten distinct phage structural classes plus “others” | 168,660 structural | 86.2% |
| 369,553 non-structural |
Database numbers—Raw sequences were downloaded using a custom script available at https://github.com/Adrian-Cantu/PhANNs.
All datasets can be downloaded from the web server. *Numbers before and after removing sequences at least 60% identical to a protein in the classes database.
| Class | Raw sequences | After manual curation | After de-replication at 40% | After expansion and de-replication at 100% |
|---|---|---|---|---|
| Major capsid | 112,987 | 105,653 | 1,945 | 35,755 |
| Minor capsid | 2,901 | 1,903 | 261 | 1,055 |
| Baseplate | 75,599 | 19,293 | 401 | 6,221 |
| Major tail | 66,513 | 35,030 | 536 | 7,704 |
| Minor tail | 94,628 | 80,467 | 918 | 18,002 |
| Portal | 210,064 | 189,143 | 2,310 | 59,745 |
| Tail fiber | 29,132 | 18,514 | 1,222 | 7,256 |
| Tail sheath | 37,885 | 35,570 | 599 | 15,349 |
| Collar | 4,224 | 3,709 | 339 | 2,105 |
| Head-Tail joining | 60,270 | 58,658 | 1,317 | 15,468 |
| Others | 733,006 | 643,735/643,380* | 106,004 | 369,553 |
Feature types included in each of the 12 models.
di—2-mer/dipeptide composition; tri—3-mer/tripeptide composition; tetra—4-mer/tetrapeptide composition; sc—side-chain grouping; p—plus all the extra features [isoelectric point, instability index (whether a protein is likely to be degraded rapidly), ORF length, aromaticity (relative frequency of aromatic amino acids), molar extinction coefficient (how much light a protein absorbs) using two methods (assuming reduced cysteines or disulfide bonds), hydrophobicity, GRAVY index (average hydropathy), and molecular weight, as computed using Biopython. - *Per class score figures are available as supplementary material.
| Model | di | tri | di_sc | tri_sc | tetra_sc | p |
|---|---|---|---|---|---|---|
| di_sc* | x | |||||
| di_sc_p* | x | x | ||||
| tri_sc* | x | |||||
| tri_sc_p* | x | x | ||||
| tetra_sc* | x | |||||
| tetra_sc_p* | x | x | ||||
| di | x | |||||
| di_p | x | x | ||||
| tri | x | |||||
| tri_p | x | x | ||||
| tetra_sc_tri_p | x | x | x | |||
| all | x | x | x | x | x | x |
Results of per class classification for proteins in the test set with a PhANNs score of 8 or higher.
Support indicates the number of test sequences in each specific class. accuracy (fraction of observation correctly classified) is equivalent to the weighted average recall (weighted by the support of each class). The macro average is unweighted (all classes contribute the same).
| precision | recall | F1-score | support | |
|---|---|---|---|---|
| 0.99 | 0.99 | 0.99 | 1,563 | |
| 0.28 | 0.96 | 0.43 | 45 | |
| 0.97 | 0.83 | 0.89 | 151 | |
| 0.95 | 0.97 | 0.96 | 307 | |
| 0.95 | 0.99 | 0.97 | 625 | |
| 0.99 | 0.94 | 0.97 | 3,810 | |
| 0.89 | 0.94 | 0.91 | 360 | |
| 1.00 | 1.00 | 1.00 | 1,495 | |
| 0.82 | 1.00 | 0.90 | 98 | |
| 0.91 | 1.00 | 0.95 | 916 | |
| 0.99 | 0.99 | 0.99 | 18,223 | |
| 0.89 | 0.96 | 0.91 | 27,593 | |
| 0.98 | 0.98 (accuracy) | 0.98 | 27,593 |
The effect on the models’s scores from excluding the minor capsid class (mc)—Most scores are affected only slightly and are as likely to improve as to worsen.
| precision | precision (mc) | recall | recall (mc) | F1-score | F1-score (mc) | support | ROC area | ROC area (mc) | |
|---|---|---|---|---|---|---|---|---|---|
| 0.76 | 0.76 | 0.92 | 0.92 | 0.83 | 0.83 | 2456 | 0.917 | 0.918 | |
| 0.08 | - | 0.77 | - | 0.15 | - | 81 (0) | 0.899 | - | |
| 0.69 | 0.69 | 0.74 | 0.83 | 0.72 | 0.75 | 851 | 0.621 | 0.72 | |
| 0.56 | 0.53 | 0.77 | 0.80 | 0.65 | 0.64 | 502 | 0.918 | 0.91 | |
| 0.75 | 0.70 | 0.82 | 0.81 | 0.78 | 0.75 | 1070 | 0.939 | 0.94 | |
| 0.83 | 0.80 | 0.81 | 0.85 | 0.82 | 0.82 | 5261 | 0.943 | 0.945 | |
| 0.31 | 0.32 | 0.76 | 0.75 | 0.44 | 0.45 | 648 | 0.861 | 0.86 | |
| 0.96 | 0.95 | 0.94 | 0.93 | 0.95 | 0.94 | 2031 | 0.986 | 0.957 | |
| 0.61 | 0.53 | 0.84 | 0.80 | 0.70 | 0.63 | 300 | 0.865 | 0.85 | |
| 0.56 | 0.58 | 0.84 | 0.85 | 0.67 | 0.69 | 1277 | 0.933 | 0.923 | |
| 0.96 | 0.96 | 0.87 | 0.88 | 0.91 | 0.92 | 33402 | 0.838 | 0.838 | |
| 0.64 | 0.68 | 0.83 | 0.84 | 0.69 | 0.74 | 47879 (47798) | |||
| 0.90 | 0.90 | 0.86 | 0.87 | 0.88 | 0.88 | 47879 (47798) |
Comparison of PhANNs with VIRALpro. Results from using VIRALpro test set in PhANNs and PhANNs test set in VIRALpro.
| PhANNs test set in TAILpro | TAILpro test set in PhANNs | PhANNs test set in CAPSIDpro | CAPSIDpro test set in PhANNs | |
|---|---|---|---|---|
| test set size | 10,805 | 672 | 15,107 | 787 |
| precision | 0.28 | 0.77 | 0.14 | 0.82 |
| recall | 0.79 | 0.68 | 0.86 | 0.32 |
| accuracy | 0.80 | 0.82 | 0.70 | 0.67 |
| F1-score | 0.42 | 0.72 | 0.25 | 0.46 |