Jie Ren1, Kai Song2, Chao Deng1, Nathan A Ahlgren3, Jed A Fuhrman4, Yi Li5, Xiaohui Xie5, Ryan Poplin6, Fengzhu Sun1. 1. Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA 90089, USA. 2. School of Mathematics and Statistics, Qingdao University, Qingdao 266071, China. 3. Department of Biology, Clark University, Worcester, MA 01610, USA. 4. Department of Biological Sciences, University of Southern California, Los Angeles, CA 90089, USA. 5. Department of Computer Science, University of California, Irvine, CA 92697, USA. 6. Google Inc., Mountain View, CA 94043, USA.
Abstract
BACKGROUND: The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data. METHODS: Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning. RESULTS: Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC. CONCLUSIONS: Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.
BACKGROUND: The recent development of metagenomic sequencing makes it possible to massively sequence microbial genomes including viral genomes without the need for laboratory culture. Existing reference-based and gene homology-based methods are not efficient in identifying unknown viruses or short viral sequences from metagenomic data. METHODS: Here we developed a reference-free and alignment-free machine learning method, DeepVirFinder, for identifying viral sequences in metagenomic data using deep learning. RESULTS: Trained based on sequences from viral RefSeq discovered before May 2015, and evaluated on those discovered after that date, DeepVirFinder outperformed the state-of-the-art method VirFinder at all contig lengths, achieving AUROC 0.93, 0.95, 0.97, and 0.98 for 300, 500, 1000, and 3000 bp sequences respectively. Enlarging the training data with additional millions of purified viral sequences from metavirome samples further improved the accuracy for identifying virus groups that are under-represented. Applying DeepVirFinder to real human gut metagenomic samples, we identified 51,138 viral sequences belonging to 175 bins in patients with colorectal carcinoma (CRC). Ten bins were found associated with the cancer status, suggesting viruses may play important roles in CRC. CONCLUSIONS: Powered by deep learning and high throughput sequencing metagenomic data, DeepVirFinder significantly improved the accuracy of viral identification and will assist the study of viruses in the era of metagenomics.
Entities:
Keywords:
deep learning; machine learning; metagenome; virus identification
Authors: Samuel Minot; Rohini Sinha; Jun Chen; Hongzhe Li; Sue A Keilbaugh; Gary D Wu; James D Lewis; Frederic D Bushman Journal: Genome Res Date: 2011-08-31 Impact factor: 9.043
Authors: Alejandro Reyes; Laura V Blanton; Song Cao; Guoyan Zhao; Mark Manary; Indi Trehan; Michelle I Smith; David Wang; Herbert W Virgin; Forest Rohwer; Jeffrey I Gordon Journal: Proc Natl Acad Sci U S A Date: 2015-09-08 Impact factor: 11.205
Authors: Ryan Poplin; Pi-Chuan Chang; David Alexander; Scott Schwartz; Thomas Colthurst; Alexander Ku; Dan Newburger; Jojo Dijamco; Nam Nguyen; Pegah T Afshar; Sam S Gross; Lizzie Dorfman; Cory Y McLean; Mark A DePristo Journal: Nat Biotechnol Date: 2018-09-24 Impact factor: 54.908
Authors: David Arndt; Jason R Grant; Ana Marcu; Tanvir Sajed; Allison Pon; Yongjie Liang; David S Wishart Journal: Nucleic Acids Res Date: 2016-05-03 Impact factor: 16.971
Authors: Navin Kumar Verma; Si Jia Tan; John Chen; Hanrong Chen; Muhammad Hafiz Ismail; Scott A Rice; Pablo Bifani; Sukumar Hariharan; Vivek Daniel Paul; Bharathi Sriram; Linh Chi Dam; Chia Ching Chan; Peiying Ho; Boon Chong Goh; Shimin Jasmine Chung; Kenneth Choon Meng Goh; Shu Hua Thong; Andrea Lay-Hoon Kwa; Adam Ostrowski; Thet Tun Aung; Halimah Razali; Shermaine W Y Low; Mani Shankar Bhattacharyya; Hemant K Gautam; Rajamani Lakshminarayanan; Thomas Sicheritz-Pontén; Martha R J Clokie; Wilfried Moreira; Maurice Adrianus Monique van Steensel Journal: Phage (New Rochelle) Date: 2022-03-18
Authors: Ahmed A Zayed; Dominik Lücking; Mohamed Mohssen; Dylan Cronin; Ben Bolduc; Ann C Gregory; Katherine R Hargreaves; Paul D Piehowski; Richard A White; Eric L Huang; Joshua N Adkins; Simon Roux; Cristina Moraru; Matthew B Sullivan Journal: Bioinformatics Date: 2021-06-16 Impact factor: 6.931
Authors: Sungeun Lee; Ella T Sieradzki; Alexa M Nicolas; Robin L Walker; Mary K Firestone; Christina Hazard; Graeme W Nicol Journal: Proc Natl Acad Sci U S A Date: 2021-08-10 Impact factor: 11.205
Authors: Christian Santos-Medellin; Laura A Zinke; Anneliek M Ter Horst; Danielle L Gelardi; Sanjai J Parikh; Joanne B Emerson Journal: ISME J Date: 2021-02-21 Impact factor: 10.302