Yang Young Lu1, Ting Chen1,2, Jed A Fuhrman3, Fengzhu Sun1,4. 1. Molecular and Computational Biology Program, Department of Biological Sciences, University of Southern California, Los Angeles, CA, USA. 2. Center for Synthetic and Systems Biology, TNLIST, Beijing, China. 3. Department of Biological Sciences and Wrigley Institute for Environmental Studies, University of Southern California, Los Angeles, CA, USA. 4. Center for Computational Systems Biology, Fudan University, Shanghai, China.
Abstract
Motivation: The advent of next-generation sequencing technologies enables researchers to sequence complex microbial communities directly from the environment. Because assembly typically produces only genome fragments, also known as contigs, instead of an entire genome, it is crucial to group them into operational taxonomic units (OTUs) for further taxonomic profiling and down-streaming functional analysis. OTU clustering is also referred to as binning. We present COCACOLA, a general framework automatically bin contigs into OTUs based on sequence composition and coverage across multiple samples. Results: The effectiveness of COCACOLA is demonstrated in both simulated and real datasets in comparison with state-of-art binning approaches such as CONCOCT, GroopM, MaxBin and MetaBAT. The superior performance of COCACOLA relies on two aspects. One is using L 1 distance instead of Euclidean distance for better taxonomic identification during initialization. More importantly, COCACOLA takes advantage of both hard clustering and soft clustering by sparsity regularization. In addition, the COCACOLA framework seamlessly embraces customized knowledge to facilitate binning accuracy. In our study, we have investigated two types of additional knowledge, the co-alignment to reference genomes and linkage of contigs provided by paired-end reads, as well as the ensemble of both. We find that both co-alignment and linkage information further improve binning in the majority of cases. COCACOLA is scalable and faster than CONCOCT, GroopM, MaxBin and MetaBAT. Availability and implementation: The software is available at https://github.com/younglululu/COCACOLA . Contact: fsun@usc.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: The advent of next-generation sequencing technologies enables researchers to sequence complex microbial communities directly from the environment. Because assembly typically produces only genome fragments, also known as contigs, instead of an entire genome, it is crucial to group them into operational taxonomic units (OTUs) for further taxonomic profiling and down-streaming functional analysis. OTU clustering is also referred to as binning. We present COCACOLA, a general framework automatically bin contigs into OTUs based on sequence composition and coverage across multiple samples. Results: The effectiveness of COCACOLA is demonstrated in both simulated and real datasets in comparison with state-of-art binning approaches such as CONCOCT, GroopM, MaxBin and MetaBAT. The superior performance of COCACOLA relies on two aspects. One is using L 1 distance instead of Euclidean distance for better taxonomic identification during initialization. More importantly, COCACOLA takes advantage of both hard clustering and soft clustering by sparsity regularization. In addition, the COCACOLA framework seamlessly embraces customized knowledge to facilitate binning accuracy. In our study, we have investigated two types of additional knowledge, the co-alignment to reference genomes and linkage of contigs provided by paired-end reads, as well as the ensemble of both. We find that both co-alignment and linkage information further improve binning in the majority of cases. COCACOLA is scalable and faster than CONCOCT, GroopM, MaxBin and MetaBAT. Availability and implementation: The software is available at https://github.com/younglululu/COCACOLA . Contact: fsun@usc.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Authors: Laura C Valk; Jeroen Frank; Pilar de la Torre-Cortés; Max van 't Hof; Antonius J A van Maris; Jack T Pronk; Mark C M van Loosdrecht Journal: Appl Environ Microbiol Date: 2018-08-31 Impact factor: 4.792
Authors: Michiel H In 't Zandt; Nardy Kip; Jeroen Frank; Stefan Jansen; Johannes A van Veen; Mike S M Jetten; Cornelia U Welte Journal: Appl Environ Microbiol Date: 2019-10-01 Impact factor: 4.792
Authors: Fernando Meyer; Peter Hofmann; Peter Belmann; Ruben Garrido-Oter; Adrian Fritz; Alexander Sczyrba; Alice C McHardy Journal: Gigascience Date: 2018-06-01 Impact factor: 6.524