Literature DB >> 30221607

An efficient classification algorithm for NGS data based on text similarity.

Xiangyu Liao1, Xingyu Liao2, Wufei Zhu3, Lu Fang3, Xing Chen3.   

Abstract

With the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the long text is transformed into a cluster consisting of reads. We tested HSC using five real datasets. The experimental results showed that HSC cluster 100 million short reads within 2 hours, and it has excellent performance in reducing memory consumption. Compared to existing methods, HSC is much faster than other tools, it can easily handle tens of millions of sequences. In addition, when HSC is used as a preprocessing tool to produce assembly data, the memory and time consumption of the assembler is greatly reduced. It can help the assembler to achieve better assemblies in terms of N50, NA50 and genome fraction.

Entities:  

Keywords:  NGS sequences data; clustering; text similarity

Mesh:

Year:  2018        PMID: 30221607      PMCID: PMC6865153          DOI: 10.1017/S0016672318000058

Source DB:  PubMed          Journal:  Genet Res (Camb)        ISSN: 0016-6723            Impact factor:   1.588


  21 in total

1.  Search and clustering orders of magnitude faster than BLAST.

Authors:  Robert C Edgar
Journal:  Bioinformatics       Date:  2010-08-12       Impact factor: 6.937

2.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Authors:  Weizhong Li; Adam Godzik
Journal:  Bioinformatics       Date:  2006-05-26       Impact factor: 6.937

3.  Sequential Discrete Hashing for Scalable Cross-Modality Similarity Retrieval.

Authors:  Li Liu; Zijia Lin; Ling Shao; Fumin Shen; Guiguang Ding; Jungong Han
Journal:  IEEE Trans Image Process       Date:  2016-10-19       Impact factor: 10.856

Review 4.  The present and future of de novo whole-genome assembly.

Authors:  Jang-Il Sohn; Jin-Wu Nam
Journal:  Brief Bioinform       Date:  2018-01-01       Impact factor: 11.622

5.  A clustering package for nucleotide sequences using Laplacian Eigenmaps and Gaussian Mixture Model.

Authors:  Marine Bruneau; Thierry Mottet; Serge Moulin; Maël Kerbiriou; Franz Chouly; Stéphane Chretien; Christophe Guyeux
Journal:  Comput Biol Med       Date:  2017-12-15       Impact factor: 4.589

6.  Metagenome sequence clustering with hash-based canopies.

Authors:  Mohammad Arifur Rahman; Nathan LaPierre; Huzefa Rangwala; Daniel Barbara
Journal:  J Bioinform Comput Biol       Date:  2017-10-09       Impact factor: 1.122

7.  Ultrafast clustering algorithms for metagenomic sequence analysis.

Authors:  Weizhong Li; Limin Fu; Beifang Niu; Sitao Wu; John Wooley
Journal:  Brief Bioinform       Date:  2012-07-06       Impact factor: 11.622

8.  SEED: efficient clustering of next-generation sequences.

Authors:  Ergude Bao; Tao Jiang; Isgouhi Kaloshian; Thomas Girke
Journal:  Bioinformatics       Date:  2011-08-02       Impact factor: 6.937

9.  DNACLUST: accurate and efficient clustering of phylogenetic marker genes.

Authors:  Mohammadreza Ghodsi; Bo Liu; Mihai Pop
Journal:  BMC Bioinformatics       Date:  2011-06-30       Impact factor: 3.169

10.  SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors:  Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal:  Gigascience       Date:  2012-12-27       Impact factor: 6.524

View more
  1 in total

1.  Unique k-mers as Strain-Specific Barcodes for Phylogenetic Analysis and Natural Microbiome Profiling.

Authors:  Valery V Panyukov; Sergey S Kiselev; Olga N Ozoline
Journal:  Int J Mol Sci       Date:  2020-01-31       Impact factor: 5.923

  1 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.