Ziye Wang1,2,3, Zhengyang Wang2, Yang Young Lu4, Fengzhu Sun1,3,4, Shanfeng Zhu2,3,5. 1. Centre for Computational Systems Biology, School of Mathematical Sciences, Shanghai, China. 2. School of Computer Science and Shanghai Key Lab of Intelligent Information Processing, Shanghai, China. 3. Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China. 4. Quantitative and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA. 5. Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, China.
Abstract
MOTIVATION: Metagenomic contig binning is an important computational problem in metagenomic research, which aims to cluster contigs from the same genome into the same group. Unlike classical clustering problem, contig binning can utilize known relationships among some of the contigs or the taxonomic identity of some contigs. However, the current state-of-the-art contig binning methods do not make full use of the additional biological information except the coverage and sequence composition of the contigs. RESULTS: We developed a novel contig binning method, Semi-supervised Spectral Normalized Cut for Binning (SolidBin), based on semi-supervised spectral clustering. Using sequence feature similarity and/or additional biological information, such as the reliable taxonomy assignments of some contigs, SolidBin constructs two types of prior information: must-link and cannot-link constraints. Must-link constraints mean that the pair of contigs should be clustered into the same group, while cannot-link constraints mean that the pair of contigs should be clustered in different groups. These constraints are then integrated into a classical spectral clustering approach, normalized cut, for improved contig binning. The performance of SolidBin is compared with five state-of-the-art genome binners, CONCOCT, COCACOLA, MaxBin, MetaBAT and BMC3C on five next-generation sequencing benchmark datasets including simulated multi- and single-sample datasets and real multi-sample datasets. The experimental results show that, SolidBin has achieved the best performance in terms of F-score, Adjusted Rand Index and Normalized Mutual Information, especially while using the real datasets and the single-sample dataset. AVAILABILITY AND IMPLEMENTATION: https://github.com/sufforest/SolidBin. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Metagenomic contig binning is an important computational problem in metagenomic research, which aims to cluster contigs from the same genome into the same group. Unlike classical clustering problem, contig binning can utilize known relationships among some of the contigs or the taxonomic identity of some contigs. However, the current state-of-the-art contig binning methods do not make full use of the additional biological information except the coverage and sequence composition of the contigs. RESULTS: We developed a novel contig binning method, Semi-supervised Spectral Normalized Cut for Binning (SolidBin), based on semi-supervised spectral clustering. Using sequence feature similarity and/or additional biological information, such as the reliable taxonomy assignments of some contigs, SolidBin constructs two types of prior information: must-link and cannot-link constraints. Must-link constraints mean that the pair of contigs should be clustered into the same group, while cannot-link constraints mean that the pair of contigs should be clustered in different groups. These constraints are then integrated into a classical spectral clustering approach, normalized cut, for improved contig binning. The performance of SolidBin is compared with five state-of-the-art genome binners, CONCOCT, COCACOLA, MaxBin, MetaBAT and BMC3C on five next-generation sequencing benchmark datasets including simulated multi- and single-sample datasets and real multi-sample datasets. The experimental results show that, SolidBin has achieved the best performance in terms of F-score, Adjusted Rand Index and Normalized Mutual Information, especially while using the real datasets and the single-sample dataset. AVAILABILITY AND IMPLEMENTATION: https://github.com/sufforest/SolidBin. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
Authors: Johannes Alneberg; Brynjar Smári Bjarnason; Ino de Bruijn; Melanie Schirmer; Joshua Quick; Umer Z Ijaz; Leo Lahti; Nicholas J Loman; Anders F Andersson; Christopher Quince Journal: Nat Methods Date: 2014-09-14 Impact factor: 28.547
Authors: John Beaulaurier; Shijia Zhu; Gintaras Deikus; Ilaria Mogno; Xue-Song Zhang; Austin Davis-Richardson; Ronald Canepa; Eric W Triplett; Jeremiah J Faith; Robert Sebra; Eric E Schadt; Gang Fang Journal: Nat Biotechnol Date: 2017-12-11 Impact factor: 54.908
Authors: Nicola Wilck; Mariana G Matus; Sean M Kearney; Scott W Olesen; Kristoffer Forslund; Hendrik Bartolomaeus; Stefanie Haase; Anja Mähler; András Balogh; Lajos Markó; Olga Vvedenskaya; Friedrich H Kleiner; Dmitry Tsvetkov; Lars Klug; Paul I Costea; Shinichi Sunagawa; Lisa Maier; Natalia Rakova; Valentin Schatz; Patrick Neubert; Christian Frätzer; Alexander Krannich; Maik Gollasch; Diana A Grohme; Beatriz F Côrte-Real; Roman G Gerlach; Marijana Basic; Athanasios Typas; Chuan Wu; Jens M Titze; Jonathan Jantsch; Michael Boschmann; Ralf Dechend; Markus Kleinewietfeld; Stefan Kempa; Peer Bork; Ralf A Linker; Eric J Alm; Dominik N Müller Journal: Nature Date: 2017-11-15 Impact factor: 49.962
Authors: Christian M K Sieber; Alexander J Probst; Allison Sharrar; Brian C Thomas; Matthias Hess; Susannah G Tringe; Jillian F Banfield Journal: Nat Microbiol Date: 2018-05-28 Impact factor: 17.745
Authors: Luke Jostins; Stephan Ripke; Rinse K Weersma; Richard H Duerr; Dermot P McGovern; Ken Y Hui; James C Lee; L Philip Schumm; Yashoda Sharma; Carl A Anderson; Jonah Essers; Mitja Mitrovic; Kaida Ning; Isabelle Cleynen; Emilie Theatre; Sarah L Spain; Soumya Raychaudhuri; Philippe Goyette; Zhi Wei; Clara Abraham; Jean-Paul Achkar; Tariq Ahmad; Leila Amininejad; Ashwin N Ananthakrishnan; Vibeke Andersen; Jane M Andrews; Leonard Baidoo; Tobias Balschun; Peter A Bampton; Alain Bitton; Gabrielle Boucher; Stephan Brand; Carsten Büning; Ariella Cohain; Sven Cichon; Mauro D'Amato; Dirk De Jong; Kathy L Devaney; Marla Dubinsky; Cathryn Edwards; David Ellinghaus; Lynnette R Ferguson; Denis Franchimont; Karin Fransen; Richard Gearry; Michel Georges; Christian Gieger; Jürgen Glas; Talin Haritunians; Ailsa Hart; Chris Hawkey; Matija Hedl; Xinli Hu; Tom H Karlsen; Limas Kupcinskas; Subra Kugathasan; Anna Latiano; Debby Laukens; Ian C Lawrance; Charlie W Lees; Edouard Louis; Gillian Mahy; John Mansfield; Angharad R Morgan; Craig Mowat; William Newman; Orazio Palmieri; Cyriel Y Ponsioen; Uros Potocnik; Natalie J Prescott; Miguel Regueiro; Jerome I Rotter; Richard K Russell; Jeremy D Sanderson; Miquel Sans; Jack Satsangi; Stefan Schreiber; Lisa A Simms; Jurgita Sventoraityte; Stephan R Targan; Kent D Taylor; Mark Tremelling; Hein W Verspaget; Martine De Vos; Cisca Wijmenga; David C Wilson; Juliane Winkelmann; Ramnik J Xavier; Sebastian Zeissig; Bin Zhang; Clarence K Zhang; Hongyu Zhao; Mark S Silverberg; Vito Annese; Hakon Hakonarson; Steven R Brant; Graham Radford-Smith; Christopher G Mathew; John D Rioux; Eric E Schadt; Mark J Daly; Andre Franke; Miles Parkes; Severine Vermeire; Jeffrey C Barrett; Judy H Cho Journal: Nature Date: 2012-11-01 Impact factor: 49.962