| Literature DB >> 28361670 |
Mengge Zhang1, Lianping Yang2, Jie Ren1, Nathan A Ahlgren3,4, Jed A Fuhrman3, Fengzhu Sun5,6.
Abstract
BACKGROUND: The study of virus-host infectious association is important for understanding the functions and dynamics of microbial communities. Both cellular and fractionated viral metagenomic data generate a large number of viral contigs with missing host information. Although relative simple methods based on the similarity between the word frequency vectors of viruses and bacterial hosts have been developed to study virus-host associations, the problem is significantly understudied. We hypothesize that machine learning methods based on word frequencies can be efficiently used to study virus-host infectious associations.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28361670 PMCID: PMC5374558 DOI: 10.1186/s12859-017-1473-7
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Description of the training and testing data
| Bacterial | Cutting | # of viruses | # of viruses | # of |
|---|---|---|---|---|
| genus | year | before | after the | non-infectious |
| cutting year | cutting year | viruses | ||
| Bacillus | 2012 | 31 | 31 | 1364 |
| Escherichia | 2012 | 141 | 32 | 1253 |
| Lactococcus | 2013 | 49 | 6 | 1371 |
| Mycobacterium | 2013 | 172 | 46 | 1208 |
| Pseudomonas | 2013 | 68 | 28 | 1330 |
| Salmonella | 2012 | 32 | 22 | 1372 |
| Staphylococcus | 2012 | 43 | 20 | 1363 |
| Synechococcus | 2012 | 30 | 17 | 1379 |
| Vibrio | 2012 | 39 | 29 | 1358 |
For a specific year, the positive training data set contains viruses infecting the corresponding host identified before the specific year and the positive testing data set contains viruses infecting the corresponding host discovered after the specific year. The negative training data and the negative testing data were chosen randomly without overlaps from the viruses that were not identified to infect the host
Fig. 1The average AUC scores of the different machine learning methods and features. The averaged AUC scores are calculated by the average of the AUC scores across different hosts and different genome background distributions. The figures from top to bottom are the performances of different word lengths k=4, k=6 and k=8. The black segments on top of the bars are the standard deviation of the AUC scores across different hosts and different genome background distributions
The AUC scores of using RF combined with the first feature with different word pattern lengths across 9 different hosts
|
|
|
| |
|---|---|---|---|
| Bacillus | 0.856 |
| 0.823 |
| Escherichia |
| 0.858 | 0.807 |
| Lactococcus | 0.972 |
| 0.988 |
| Mycobacterium |
| 0.985 | 0.984 |
| Pseudomonas | 0.978 |
| 0.967 |
| Salmonella | 0.889 |
| 0.891 |
| Staphylococcus |
| 0.987 | 0.983 |
| Synechococcus | 0.965 |
| 0.955 |
| Vibrio | 0.936 |
| 0.892 |
The highest score for each host is highlighted in bold
Average Manhattan distances of word relative frequency vectors between pairs of viruses infecting each host
|
|
|
| |
|---|---|---|---|
| Bacillus | 0.342 | 0.558 | 1.146 |
| Escherichia | 0.372 | 0.706 | 1.416 |
| Lactococcus | 0.194 | 0.417 | 1.020 |
| Mycobacterium | 0.292 | 0.513 | 1.044 |
| Pseudomonas | 0.379 | 0.633 | 1.241 |
| Salmonella | 0.324 | 0.568 | 1.249 |
| Staphylococcus | 0.266 | 0.455 | 0.984 |
| Synechococcus | 0.335 | 0.516 | 0.986 |
| Vibrio | 0.371 | 0.658 | 1.360 |
The distribution of viruses among three major viral families, Myoviridae, Podoviridae and Siphoviridae, for viruses infecting each of the nine host genera
| Myoviruses | Podoviruses | Siphoviruses | other |
| |
|---|---|---|---|---|---|
| Bacillus | 21 | 8 | 27 | 3 | 1.656 |
| Escherichia | 49 | 30 | 43 | 51 | 1.972 |
| Lactococcus | 0 | 2 | 53 | 0 | 0.472 |
| Mycobacterium | 10 | 0 | 206 | 2 | 0.384 |
| Pseudomonas | 38 | 29 | 21 | 8 | 1.829 |
| Salmonella | 10 | 19 | 21 | 4 | 1.789 |
| Staphylococcus | 19 | 4 | 35 | 5 | 1.535 |
| Synechococcus | 28 | 6 | 5 | 8 | 1.603 |
| Vibrio | 20 | 21 | 6 | 21 | 1.875 |
The last column is the entropy of the distribution
Comparison of AUC scores of RF (random forest) combined with word frequency vector with that based on Manhattan distance and statistic when k=6
| Manhattan |
| RF-feat-1 | |||
|---|---|---|---|---|---|
|
| 1 | 2 | |||
| Bacillus | 0.829 | 0.752 | 0.873 | 0.851 | 0.863 |
| Escherichia | 0.880 | 0.833 | 0.958 | 0.945 | 0.856 |
| Lactococcus | 0.767 | 0.775 | 0.828 | 0.750 | 1.000 |
| Mycobacterium | 0.976 | 0.977 | 0.966 | 0.984 | 0.985 |
| Pseudomonas | 0.951 | 0.934 | 0.974 | 0.970 | 0.981 |
| Salmonella | 0.837 | 0.818 | 0.900 | 0.900 | 0.896 |
| Staphylococcus | 0.964 | 0.941 | 0.947 | 0.974 | 0.987 |
| Synechococcus | 0.929 | 0.906 | 0.994 | 0.993 | 0.978 |
| Vibrio | 0.841 | 0.733 | 0.854 | 0.817 | 0.940 |
For the background model of statistic, we considered independent identically distributed (i.i.d.) model, first and second order Markov chains
Fig. 2Scheme of the RF method for the prediction of hosts of viral contigs of different lengths. For each of the 9 main host genera, we produced 4 training datasets with differed sequence lengths by breaking the whole viral genomes into nonoverlapping contigs of lengths 1, 3, 5 kbps and the whole genomes with 0.05% sequencing errors added; Similarly, we also generated 4 different testing datasets with different contig lengths. We then evaluated the performances of RF for each training dataset with specific sequence length on all 4 testing datasets
The AUC scores of the RF method for the prediction of hosts of viral contigs with different lengths using the models built from contigs of different lengths
| Testing: 1 kb | |||||||||
| Baci. | Esch. | Lact. | Myco. | Pseu. | Salm. | Stap. | Syne. | Virb. | |
| Training: 1 kb | 0.773 | 0.805 | 0.840 | 0.962 | 0.924 | 0.812 | 0.936 | 0.928 | 0.842 |
| Training: 3 kb | 0.819 | 0.857 | 0.833 | 0.977 | 0.959 | 0.818 | 0.955 | 0.960 | 0.858 |
| Training: 5 kb | 0.831 | 0.848 | 0.848 | 0.977 | 0.952 | 0.826 | 0.957 | 0.957 | 0.845 |
| Training: wgs | 0.821 | 0.718 | 0.818 | 0.886 | 0.833 | 0.792 | 0.948 | 0.890 | 0.774 |
| Testing: 3 kb | |||||||||
| Baci. | Esch. | Lact. | Myco. | Pseu. | Salm. | Stap. | Syne. | Virb. | |
| Training: 1 kb | 0.766 | 0.862 | 0.842 | 0.979 | 0.947 | 0.843 | 0.961 | 0.961 | 0.880 |
| Training: 3 kb | 0.823 | 0.878 | 0.868 | 0.980 | 0.975 | 0.866 | 0.966 | 0.967 | 0.898 |
| Training: 5 kb | 0.850 | 0.899 | 0.889 | 0.985 | 0.978 | 0.880 | 0.976 | 0.983 | 0.917 |
| Training: wgs | 0.854 | 0.827 | 0.885 | 0.967 | 0.952 | 0.872 | 0.976 | 0.951 | 0.876 |
| Testing: 5 kb | |||||||||
| Baci. | Esch. | Lact. | Myco. | Pseu. | Salm. | Stap. | Syne. | Virb. | |
| Training: 1 kb | 0.768 | 0.870 | 0.822 | 0.982 | 0.955 | 0.838 | 0.964 | 0.972 | 0.867 |
| Training: 3 kb | 0.827 | 0.900 | 0.869 | 0.985 | 0.977 | 0.871 | 0.974 | 0.986 | 0.904 |
| Training: 5 kb | 0.852 | 0.890 | 0.883 | 0.986 | 0.978 | 0.883 | 0.972 | 0.972 | 0.907 |
| Training: wgs | 0.858 | 0.865 | 0.888 | 0.983 | 0.970 | 0.882 | 0.979 | 0.965 | 0.900 |
| Testing: wgs | |||||||||
| Baci. | Esch. | Lact. | Myco. | Pseu. | Salm. | Stap. | Syne. | Virb. | |
| Training: 1 kb | 0.778 | 0.860 | 0.814 | 0.984 | 0.960 | 0.812 | 0.971 | 0.981 | 0.896 |
| Training: 3 kb | 0.854 | 0.901 | 0.817 | 0.994 | 0.988 | 0.884 | 0.989 | 0.994 | 0.930 |
| Training: 5 kb | 0.870 | 0.923 | 0.861 | 0.994 | 0.992 | 0.889 | 0.992 | 0.996 | 0.934 |
| Training: wgs | 0.862 | 0.859 | 1.0 | 0.985 | 0.979 | 0.893 | 0.987 | 0.981 | 0.938 |
Fig. 3The histograms of the RF scores for the viral sequences in the negative and positive data sets, and a T4-like, c non-T4-like, and e viral contigs, respectively. The corresponding fitted density functions are given as b, d, and f, respectively. In all of the 6 subfigures, the horizontal axis is the prediction scores. In a, c and e, the right y-axis indicates the fraction and the left y-axis indicates the fraction divided by the bin-size