| Literature DB >> 30717665 |
Duleepa Jayasundara1, Damayanthi Herath2,3, Damith Senanayake2, Isaam Saeed2, Cheng-Yu Yang4, Yuan Sun2, Bill C Chang5, Sen-Lin Tang6, Saman K Halgamuge2,7.
Abstract
BACKGROUND: Estimating the parameters that describe the ecology of viruses,particularly those that are novel, can be made possible using metagenomic approaches. However, the best-performing existing methods require databases to first estimate an average genome length of a viral community before being able to estimate other parameters, such as viral richness. Although this approach has been widely used, it can adversely skew results since the majority of viruses are yet to be catalogued in databases.Entities:
Keywords: Average genome length; Richness estimation; Viral metagenomics
Mesh:
Year: 2019 PMID: 30717665 PMCID: PMC7394321 DOI: 10.1186/s12859-018-2398-5
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Performance of ENVirT in comparison to standard GA algorithm on simulated contig spectra
| Input parameters (expected result) | Estimated values by ENVirT | Estimated values by GA without niching | |||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| Evenness |
|
|
|
|
|
|
|
|
|
|
|
| 12500 | 300 | exp | 0.030 | 0.790 | 2.956% | 12500 | 300 | exp | 0.030 | 0.00x10 0 | 39500 | 12400 | exp | 0.095 | 3.49x10 -2 |
| 12500 | 1000 | log | 0.900 | 0.995 | 0.661% | 14972 | 838 | log | 0.893 | 6.56x10 -3 | 310000 | 100 | lgn | 1.063 | 2.59x10 1 |
| 12500 | 5000 | lgn | 2.500 | 0.655 | 11.849% | 12500 | 5000 | lgn | 2.500 | 0.00x10 0 | 12500 | 5000 | lgn | 2.500 | 0.00x10 0 |
| 12500 | 10000 | pl | 0.700 | 0.913 | 1.997% | 12500 | 10000 | pl | 0.700 | 0.00x10 0 | 29500 | 1400 | log | 1.911 | 6.38x10 0 |
| 50000 | 300 | exp | 0.030 | 0.790 | 2.956% | 50000 | 300 | exp | 0.030 | 0.00x10 0 | 41000 | 100 | pl | 0.378 | 1.53x10 1 |
| 50000 | 1000 | log | 0.900 | 0.995 | 0.661% | 50000 | 1000 | log | 0.900 | 0.00x10 0 | 100500 | 600 | lgn | 0.531 | 3.48x10 -2 |
| 50000 | 5000 | lgn | 2.500 | 0.655 | 11.849% | 50000 | 5000 | lgn | 2.500 | 0.00x10 0 | 50000 | 5100 | lgn | 2.506 | 1.92x10 -2 |
| 50000 | 10000 | pl | 0.700 | 0.913 | 1.997% | 52787 | 10175 | pl | 0.707 | 1.72x10 -3 | 41000 | 9800 | pl | 0.677 | 2.22x10 -2 |
| 125000 | 300 | exp | 0.030 | 0.790 | 2.956% | 125000 | 300 | exp | 0.030 | 0.00x10 0 | 58500 | 11000 | exp | 0.014 | 2.70x10 -2 |
| 125000 | 1000 | log | 0.900 | 0.995 | 0.661% | 125000 | 1000 | log | 0.900 | 0.00x10 0 | 69000 | 1800 | log | 0.943 | 3.94x10 -4 |
| 125000 | 5000 | lgn | 2.500 | 0.655 | 11.849% | 125000 | 5000 | lgn | 2.500 | 0.00x10 0 | 125000 | 5000 | lgn | 2.500 | 0.00x10 0 |
| 125000 | 10000 | pl | 0.700 | 0.913 | 1.997% | 116341 | 9824 | pl | 0.691 | 1.96x10 -4 | 203000 | 15000 | lgn | 1.922 | 9.34x10 -1 |
| 300000 | 300 | exp | 0.030 | 0.790 | 2.956% | 300000 | 300 | exp | 0.030 | 0.00x10 0 | 67000 | 400 | lgn | 0.543 | 5.36x10 -2 |
| 300000 | 1000 | log | 0.900 | 0.995 | 0.661% | 217303 | 1373 | log | 0.899 | 1.26x10 -7 | 156000 | 1900 | log | 0.931 | 1.93x10 -5 |
| 300000 | 5000 | lgn | 2.500 | 0.655 | 11.849% | 300000 | 5000 | lgn | 2.500 | 0.00x10 0 | 310000 | 7400 | lgn | 2.635 | 1.09x10 -1 |
| 300000 | 10000 | pl | 0.700 | 0.913 | 1.997% | 277000 | 9800 | pl | 0.690 | 3.00x10 -5 | 77000 | 5600 | log | 1.658 | 2.97x10 -2 |
Contig spectra were generated with parameters: R=10000, r=100bp and o=35bp. pl = power-law distribution, exp = exponential distribution, log = logarithmic distribution and lgn = lognormal distribution. f= relative abundance of the dominant genotype. S= the value of the cost function corresponding to the estimated values of M,L,T and d. GA = Genetic Algorithm. We chose M=1,M=15000,L=10000,L=310000,d=0.01 and d=5 for both ENVirT and GA without niching. In order to apply the second niching strategy of ENVirT, we chose N=29
Comparison between PHACCS + GAAS/BLAST and ENVirT estimates of viral richness and average genome length on viral metagenomes derived from different environments
| Source | Sample name | ENVirT | PHACCS | ||||
|---|---|---|---|---|---|---|---|
|
|
| Evenness |
|
| Evenness | ||
| French Lakes | Lake Bourget | 62279 | 42999 | 0.84862 | 13089 ⋆ | 33311 | 0.92228 |
| [ | Lake Pavin | 81110 | 792 | 0.82202 | 12274 ⋆ | 2628 | 0.89747 |
| Feitsui | V1 | 24112 | 587 | 0.84216 | 44297 ⋆ | 3059 | 0.72402 |
| Reservoir | V2 | 16613 | 1288 | 0.88611 | 43926 ⋆ | 513 | 0.93042 |
| [ | V3 | 31019 | 617 | 0.93707 | 95269 ⋆ | 174 | 0.94079 |
| V4 | 16535 | 1092 | 0.89225 | 62395 ⋆ | 399 | 0.91161 | |
| V5 | 15177 | 1121 | 0.89919 | 41377 ⋆ | 419 | 0.93946 | |
| V6 | 46677 | 1929 | 0.79735 | 125321 ⋆ | 221 | 0.90320 | |
| Fermented | Shrimp | 27337 | 4931 | 0.92204 | 39839 | 4606 | 0.90349 |
| food | Kimchi | 53837 | 1395 | 0.88842 | 48220 | 1415 | 0.89653 |
| [ | Sauerkraut | 277163 | 719 | 0.80599 | 36494 | 2692 | 0.86619 |
| Perennial ponds | Ilij | 75242 | 1703 | 0.88137 | 71477 ⋆ | 1687 | 0.88550 |
| of the | Molomhar | 394921 | 223 | 0.87082 | 60959 ⋆ | 1318 | 0.89228 |
| Mauritanian Sahara | Hamdoun | 176346 | 515 | 0.66600 | 60479 ⋆ | 217 | 0.88719 |
| [ | El Berbera | 81118 | 6199 | 0.69961 | 76501 ⋆ | 5696 | 0.71009 |
| Human | X-1 | 175863 | 559 | 0.83496 | 50000 | 815 | 0.92174 |
| gut | H1-1 | 497223 | 609 | 0.62918 | 50000 | 397 | 0.92259 |
| [ | H1-2 | 387877 | 212 | 0.73163 | 50000 | 353 | 0.92904 |
| H1-7 | 282786 | 151 | 0.78132 | 50000 | 315 | 0.92531 | |
| H1-8 | 570706 | 121 | 0.68525 | 50000 | 239 | 0.94400 | |
M = estimated richness, L = estimated average genome length (bp). § = Average genome length used in the original publication. ⋆= An estimate based on GAAS software ([33]). †= An estimate based on a BLAST search. ‡= Assumed value
Performance comparison between ENVirT, PHACCS and CatchAll on simulated contig spectra
| Input parameters (expected result) | ENVirT | PHACCS | CatchAll | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
| Evenness |
|
|
|
|
|
|
|
|
|
|
| 12500 | 300 | exp | 0.030 | 0.790 | 2.956% | 300 | exp | 0.030 | 0.00x10 0 | 4096 | exp | 0.030 | 1.37x10 -3 | 2829.6 |
| 12500 | 1000 | log | 0.900 | 0.995 | 0.661% | 1000 | log | 0.900 | 0.00x10 0 | 1000 | log | 0.900 | 0.00x10 0 | 92628.3 |
| 12500 | 5000 | lgn | 2.500 | 0.655 | 11.849% | 5000 | lgn | 2.500 | 0.00x10 0 | 23563 | pl | 1.313 | 1.01x10 4 | 3246.1 |
| 12500 | 10000 | pl | 0.700 | 0.913 | 1.997% | 10000 | pl | 0.700 | 0.00x10 0 | 10000 | pl | 0.700 | 0.00x10 0 | 696.3 |
| 50000 | 300 | exp | 0.030 | 0.790 | 2.956% | 300 | exp | 0.030 | 0.00x10 0 | 10000 | exp | 0.030 | 4.31x10 -4 | 15712.6 |
| 50000 | 1000 | log | 0.900 | 0.995 | 0.661% | 1000 | log | 0.900 | 0.00x10 0 | 1000 | log | 0.900 | 0.00x10 0 | n/a |
| 50000 | 5000 | lgn | 2.500 | 0.655 | 11.849% | 5000 | lgn | 2.500 | 0.00x10 0 | 4996 | lgn | 2.500 | 1.78x10 -3 | 799.8 |
| 50000 | 10000 | pl | 0.700 | 0.913 | 1.997% | 10000 | pl | 0.700 | 0.00x10 0 | 10000 | pl | 0.700 | 0.00x10 0 | 413688.9 |
| 125000 | 300 | exp | 0.030 | 0.790 | 2.956% | 300 | exp | 0.030 | 0.00x10 0 | 10000 | exp | 0.060 | 1.87x10 -4 | 70340.9 |
| 125000 | 1000 | log | 0.900 | 0.995 | 0.661% | 1000 | log | 0.900 | 0.00x10 0 | 1000 | log | 0.900 | 0.00x10 0 | n/a |
| 125000 | 5000 | lgn | 2.500 | 0.655 | 11.849% | 5000 | lgn | 2.500 | 0.00x10 0 | 5000 | lgn | 2.500 | 0.00x10 0 | 2303.2 |
| 125000 | 10000 | pl | 0.700 | 0.913 | 1.997% | 10000 | pl | 0.700 | 0.00x10 0 | 10000 | pl | 0.700 | 0.00x10 0 | n/a |
| 300000 | 300 | exp | 0.030 | 0.790 | 2.956% | 300 | exp | 0.030 | 0.00x10 0 | 4096 | exp | 0.030 | 7.92x10 -5 | 160243.9 |
| 300000 | 1000 | log | 0.900 | 0.995 | 0.661% | 1000 | log | 0.900 | 0.00x10 0 | 1000 | log | 0.900 | 0.00x10 0 | n/a |
| 300000 | 5000 | lgn | 2.500 | 0.655 | 11.849% | 5000 | lgn | 2.500 | 0.00x10 0 | 5000 | lgn | 2.500 | 0.00x10 0 | 146552.7 |
| 300000 | 10000 | pl | 0.700 | 0.913 | 1.997% | 8547 | pl | 0.689 | 3.00x10 -3 | 10000 | pl | 0.700 | 0.00x10 0 | n/a |
Contig spectra were generated with parameters: R=10000, r= 100bp and o= 35bp. Both ENVirT and PHACCS were provided with the true average genome length (L0) value. pl = power-law distribution, exp = exponential distribution, log = logarithmic distribution and lgn = lognormal distribution. S = the value of the cost function corresponding to the estimated values of M,T and d for each method. For each spectrum, the CatchAll estimate having the minimum error compared to M0 is reported. = best discounted parametric model produced by CatchAll. = Chao1 non-parametric estimate. n/a denotes samples for which CatchAll failed to produce an output
Fig. 1Estimated average genome length versus true average genome length for ENVirT. This analysis uses only the information contained in the contig spectra for a given virome. As such, it is shown that ENVirT does not require underlying sequence data or other databases to estimate an average genome length. Here, ENVirT is able to estimate the true average genome length with an average error of 9.13%
Fig. 2a CV(RMSE) of estimated M when M0=300, b CV(RMSE) of estimated M when M0=10000, c CV(RMSE) of estimated L when M0=300 and d CV(RMSE) of estimated L when M0=10000: of spectra in Simulation Scenario 2 categorized under different values of v (v∈{0.0001,0.0005,0.001,0.005,0.01,0.05,0.1}). CV(RMSE) = Coefficient of Variation of the Root Mean Squared Error, M0 = simulated true richness, M = estimated richness and L = estimated average genome length. All 140 spectra used here were derived from populations simulated using a genome length distribution of and L0=50 kbp. ENVirT-FL = ENVirT algorithm given a fixed value for L. Only the power-law distribution was considered in all three methods. Values summarized in the figure are given in Table S1 of Additional file 1)