Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Metagenome sequence clustering with hash-based canopies.

Literature DB >> 29113561

Metagenome sequence clustering with hash-based canopies.

Mohammad Arifur Rahman¹, Nathan LaPierre², Huzefa Rangwala¹, Daniel Barbara¹.

Abstract

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a.

Entities: Species

Keywords: 16S; Clustering; biodiversity; canopy; metagenome

Mesh：

Substances：

Year: 2017 PMID： 29113561 DOI： 10.1142/S0219720017400066

Source DB: PubMed Journal: J Bioinform Comput Biol ISSN： 0219-7200 Impact factor: 1.122

Keyword Cloud
Cited

2 in total

1. An efficient classification algorithm for NGS data based on text similarity.

Authors: Xiangyu Liao; Xingyu Liao; Wufei Zhu; Lu Fang; Xing Chen
Journal: Genet Res (Camb) Date: 2018-09-17 Impact factor: 1.588

2. IDMIL: an alignment-free Interpretable Deep Multiple Instance Learning (MIL) for predicting disease from whole-metagenomic data.

Authors: Mohammad Arifur Rahman; Huzefa Rangwala
Journal: Bioinformatics Date: 2020-07-01 Impact factor: 6.937

2 in total