Congyu Lu1, Zheng Zhang1, Zena Cai1, Zhaozhong Zhu1, Ye Qiu1, Aiping Wu2,3, Taijiao Jiang2,3, Heping Zheng1, Yousong Peng4. 1. Bioinformatics Center, College of Biology, Hunan Provincial Key Laboratory of Medical Virology, Hunan University, Changsha, China. 2. Center for Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, 100005, China. 3. Suzhou Institute of Systems Medicine, Suzhou, 215123, Jiangsu, China. 4. Bioinformatics Center, College of Biology, Hunan Provincial Key Laboratory of Medical Virology, Hunan University, Changsha, China. pys2013@hnu.edu.cn.
Abstract
BACKGROUND: Viruses are ubiquitous biological entities, estimated to be the largest reservoirs of unexplored genetic diversity on Earth. Full functional characterization and annotation of newly discovered viruses requires tools to enable taxonomic assignment, the range of hosts, and biological properties of the virus. Here we focus on prokaryotic viruses, which include phages and archaeal viruses, and for which identifying the viral host is an essential step in characterizing the virus, as the virus relies on the host for survival. Currently, the method for determining the viral host is either to culture the virus, which is low-throughput, time-consuming, and expensive, or to computationally predict the viral hosts, which needs improvements at both accuracy and usability. Here we develop a Gaussian model to predict hosts for prokaryotic viruses with better performances than previous computational methods. RESULTS: We present here Prokaryotic virus Host Predictor (PHP), a software tool using a Gaussian model, to predict hosts for prokaryotic viruses using the differences of k-mer frequencies between viral and host genomic sequences as features. PHP gave a host prediction accuracy of 34% (genus level) on the VirHostMatcher benchmark dataset and a host prediction accuracy of 35% (genus level) on a new dataset containing 671 viruses and 60,105 prokaryotic genomes. The prediction accuracy exceeded that of two alignment-free methods (VirHostMatcher and WIsH, 28-34%, genus level). PHP also outperformed these two alignment-free methods much (24-38% vs 18-20%, genus level) when predicting hosts for prokaryotic viruses which cannot be predicted by the BLAST-based or the CRISPR-spacer-based methods alone. Requiring a minimal score for making predictions (thresholding) and taking the consensus of the top 30 predictions further improved the host prediction accuracy of PHP. CONCLUSIONS: The Prokaryotic virus Host Predictor software tool provides an intuitive and user-friendly API for the Gaussian model described herein. This work will facilitate the rapid identification of hosts for newly identified prokaryotic viruses in metagenomic studies.
BACKGROUND: Viruses are ubiquitous biological entities, estimated to be the largest reservoirs of unexplored genetic diversity on Earth. Full functional characterization and annotation of newly discovered viruses requires tools to enable taxonomic assignment, the range of hosts, and biological properties of the virus. Here we focus on prokaryotic viruses, which include phages and archaeal viruses, and for which identifying the viral host is an essential step in characterizing the virus, as the virus relies on the host for survival. Currently, the method for determining the viral host is either to culture the virus, which is low-throughput, time-consuming, and expensive, or to computationally predict the viral hosts, which needs improvements at both accuracy and usability. Here we develop a Gaussian model to predict hosts for prokaryotic viruses with better performances than previous computational methods. RESULTS: We present here Prokaryotic virus Host Predictor (PHP), a software tool using a Gaussian model, to predict hosts for prokaryotic viruses using the differences of k-mer frequencies between viral and host genomic sequences as features. PHP gave a host prediction accuracy of 34% (genus level) on the VirHostMatcher benchmark dataset and a host prediction accuracy of 35% (genus level) on a new dataset containing 671 viruses and 60,105 prokaryotic genomes. The prediction accuracy exceeded that of two alignment-free methods (VirHostMatcher and WIsH, 28-34%, genus level). PHP also outperformed these two alignment-free methods much (24-38% vs 18-20%, genus level) when predicting hosts for prokaryotic viruses which cannot be predicted by the BLAST-based or the CRISPR-spacer-based methods alone. Requiring a minimal score for making predictions (thresholding) and taking the consensus of the top 30 predictions further improved the host prediction accuracy of PHP. CONCLUSIONS: The Prokaryotic virus Host Predictor software tool provides an intuitive and user-friendly API for the Gaussian model described herein. This work will facilitate the rapid identification of hosts for newly identified prokaryotic viruses in metagenomic studies.
Authors: Ann C Gregory; Ahmed A Zayed; Nádia Conceição-Neto; Ben Temperton; Ben Bolduc; Adriana Alberti; Mathieu Ardyna; Ksenia Arkhipova; Margaux Carmichael; Corinne Cruaud; Céline Dimier; Guillermo Domínguez-Huerta; Joannie Ferland; Stefanie Kandels; Yunxiao Liu; Claudie Marec; Stéphane Pesant; Marc Picheral; Sergey Pisarev; Julie Poulain; Jean-Éric Tremblay; Dean Vik; Marcel Babin; Chris Bowler; Alexander I Culley; Colomban de Vargas; Bas E Dutilh; Daniele Iudicone; Lee Karp-Boss; Simon Roux; Shinichi Sunagawa; Patrick Wincker; Matthew B Sullivan Journal: Cell Date: 2019-04-25 Impact factor: 41.582
Authors: Eric W Sayers; Richa Agarwala; Evan E Bolton; J Rodney Brister; Kathi Canese; Karen Clark; Ryan Connor; Nicolas Fiorini; Kathryn Funk; Timothy Hefferon; J Bradley Holmes; Sunghwan Kim; Avi Kimchi; Paul A Kitts; Stacy Lathrop; Zhiyong Lu; Thomas L Madden; Aron Marchler-Bauer; Lon Phan; Valerie A Schneider; Conrad L Schoch; Kim D Pruitt; James Ostell Journal: Nucleic Acids Res Date: 2019-01-08 Impact factor: 16.971
Authors: Ruonan Wu; Clyde A Smith; Garry W Buchko; Ian K Blaby; David Paez-Espino; Nikos C Kyrpides; Yasuo Yoshikuni; Jason E McDermott; Kirsten S Hofmockel; John R Cort; Janet K Jansson Journal: Nat Commun Date: 2022-09-19 Impact factor: 17.694