| Literature DB >> 30131825 |
Deyvid Amgarten1, Lucas P P Braga1,2, Aline M da Silva1, João C Setubal1,3.
Abstract
Here we present MARVEL, a tool for prediction of double-stranded DNA bacteriophage sequences in metagenomic bins. MARVEL uses a random forest machine learning approach. We trained the program on a dataset with 1,247 phage and 1,029 bacterial genomes, and tested it on a dataset with 335 bacterial and 177 phage genomes. We show that three simple genomic features extracted from contig sequences were sufficient to achieve a good performance in separating bacterial from phage sequences: gene density, strand shifts, and fraction of significant hits to a viral protein database. We compared the performance of MARVEL to that of VirSorter and VirFinder, two popular programs for predicting viral sequences. Our results show that all three programs have comparable specificity, but MARVEL achieves much better performance on the recall (sensitivity) measure. This means that MARVEL should be able to identify many more phage sequences in metagenomic bins than heretofore has been possible. In a simple test with real data, containing mostly bacterial sequences, MARVEL classified 58 out of 209 bins as phage genomes; other evidence suggests that 57 of these 58 bins are novel phage sequences. MARVEL is freely available at https://github.com/LaboratorioBioinformatica/MARVEL.Entities:
Keywords: machine learning; microbiome; phage; random forest; virus
Year: 2018 PMID: 30131825 PMCID: PMC6090037 DOI: 10.3389/fgene.2018.00304
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
Mean values (and respective standard deviations) for three features extracted from the training dataset of dsDNA phage and bacterial genomes.
| Features | |||
|---|---|---|---|
| Gene density (genes by kbp) | Strand shifts by total number of genes | Fraction of pVOGs significant hits | |
| Phage | 1.44 (±0.27) | 0.07 (±0.05) | 0.68 (±0.2) |
| Bacteria | 0.93 (±0.13) | 0.24 (±0.08) | 0.1 (±0.04) |
Running time for two different set of bins.
| 100 bins of ∼40 kbp | 100 bins of ∼160 kbp | |||
|---|---|---|---|---|
| Wall | CPU | Wall | CPU | |
| time | usage | time | usage | |
| MARVEL | 11 m 33 s | 17 m 54 s | 36 m 33 s | 70 m 45 s |
| VirSorter | 10 m 20 s | 27 m 21 s | 39 m 18 s | 140 m 8 s |
| VirFinder | 40 s | 40 s | 42 s | 42 s |