Jiayu Shang1, Xubo Tang1, Ruocheng Guo2, Yanni Sun1. 1. Department of Electrical Engineering, City University of Hong Kong, Hong Kong (SAR), China. 2. School of Data Science, City University of Hong Kong, Hong Kong (SAR), China.
Abstract
MOTIVATION: Bacteriophages are viruses infecting bacteria. Being key players in microbial communities, they can regulate the composition/function of microbiome by infecting their bacterial hosts and mediating gene transfer. Recently, metagenomic sequencing, which can sequence all genetic materials from various microbiome, has become a popular means for new phage discovery. However, accurate and comprehensive detection of phages from the metagenomic data remains difficult. High diversity/abundance, and limited reference genomes pose major challenges for recruiting phage fragments from metagenomic data. Existing alignment-based or learning-based models have either low recall or precision on metagenomic data. RESULTS: In this work, we adopt the state-of-the-art language model, Transformer, to conduct contextual embedding for phage contigs. By constructing a protein-cluster vocabulary, we can feed both the protein composition and the proteins' positions from each contig into the Transformer. The Transformer can learn the protein organization and associations using the self-attention mechanism and predicts the label for test contigs. We rigorously tested our developed tool named PhaMer on multiple datasets with increasing difficulty, including quality RefSeq genomes, short contigs, simulated metagenomic data, mock metagenomic data and the public IMG/VR dataset. All the experimental results show that PhaMer outperforms the state-of-the-art tools. In the real metagenomic data experiment, PhaMer improves the F1-score of phage detection by 27%.
MOTIVATION: Bacteriophages are viruses infecting bacteria. Being key players in microbial communities, they can regulate the composition/function of microbiome by infecting their bacterial hosts and mediating gene transfer. Recently, metagenomic sequencing, which can sequence all genetic materials from various microbiome, has become a popular means for new phage discovery. However, accurate and comprehensive detection of phages from the metagenomic data remains difficult. High diversity/abundance, and limited reference genomes pose major challenges for recruiting phage fragments from metagenomic data. Existing alignment-based or learning-based models have either low recall or precision on metagenomic data. RESULTS: In this work, we adopt the state-of-the-art language model, Transformer, to conduct contextual embedding for phage contigs. By constructing a protein-cluster vocabulary, we can feed both the protein composition and the proteins' positions from each contig into the Transformer. The Transformer can learn the protein organization and associations using the self-attention mechanism and predicts the label for test contigs. We rigorously tested our developed tool named PhaMer on multiple datasets with increasing difficulty, including quality RefSeq genomes, short contigs, simulated metagenomic data, mock metagenomic data and the public IMG/VR dataset. All the experimental results show that PhaMer outperforms the state-of-the-art tools. In the real metagenomic data experiment, PhaMer improves the F1-score of phage detection by 27%.
Authors: Yuriy Chaban; Rudi Lurz; Sandrine Brasilès; Charlène Cornilleau; Matthia Karreman; Sophie Zinn-Justin; Paulo Tavares; Elena V Orlova Journal: Proc Natl Acad Sci U S A Date: 2015-05-19 Impact factor: 11.205
Authors: Zhi-Ping Zhong; Funing Tian; Simon Roux; M Consuelo Gazitúa; Natalie E Solonenko; Yueh-Fen Li; Mary E Davis; James L Van Etten; Ellen Mosley-Thompson; Virginia I Rich; Matthew B Sullivan; Lonnie G Thompson Journal: Microbiome Date: 2021-07-20 Impact factor: 14.650
Authors: Manuel Kleiner; Erin Thorson; Christine E Sharp; Xiaoli Dong; Dan Liu; Carmen Li; Marc Strous Journal: Nat Commun Date: 2017-11-16 Impact factor: 14.919
Authors: Young Joon Park; Sang Yun Cho; Jin Lee; Ikjin Lee; Won-Ho Park; Seungmyeong Jeong; Seongyun Kim; Seokjun Lee; Jaeho Kim; Ok Park Journal: Osong Public Health Res Perspect Date: 2020-06