Literature DB >> 29718317

MeShClust: an intelligent tool for clustering DNA sequences.

Benjamin T James1,2, Brian B Luczak1,2, Hani Z Girgis1.   

Abstract

Sequence clustering is a fundamental step in analyzing DNA sequences. Widely-used software tools for sequence clustering utilize greedy approaches that are not guaranteed to produce the best results. These tools are sensitive to one parameter that determines the similarity among sequences in a cluster. Often times, a biologist may not know the exact sequence similarity. Therefore, clusters produced by these tools do not likely match the real clusters comprising the data if the provided parameter is inaccurate. To overcome this limitation, we adapted the mean shift algorithm, an unsupervised machine-learning algorithm, which has been used successfully thousands of times in fields such as image processing and computer vision. The theory behind the mean shift algorithm, unlike the greedy approaches, guarantees convergence to the modes, e.g. cluster centers. Here we describe the first application of the mean shift algorithm to clustering DNA sequences. MeShClust is one of few applications of the mean shift algorithm in bioinformatics. Further, we applied supervised machine learning to predict the identity score produced by global alignment using alignment-free methods. We demonstrate MeShClust's ability to cluster DNA sequences with high accuracy even when the sequence similarity parameter provided by the user is not very accurate.

Entities:  

Mesh:

Year:  2018        PMID: 29718317      PMCID: PMC6101578          DOI: 10.1093/nar/gky315

Source DB:  PubMed          Journal:  Nucleic Acids Res        ISSN: 0305-1048            Impact factor:   16.971


  26 in total

1.  On-line hierarchy of general linear models for selecting and ranking the best predicted protein structures.

Authors:  Hani Zakaria Girgis; Jason J Corso; Daniel Fischer
Journal:  Conf Proc IEEE Eng Med Biol Soc       Date:  2009

2.  An improved algorithm for matching biological sequences.

Authors:  O Gotoh
Journal:  J Mol Biol       Date:  1982-12-15       Impact factor: 5.469

3.  d2_cluster: a validated method for clustering EST and full-length cDNAsequences.

Authors:  J Burke; D Davison; W Hide
Journal:  Genome Res       Date:  1999-11       Impact factor: 9.043

4.  Bacterial community variation in human body habitats across space and time.

Authors:  Elizabeth K Costello; Christian L Lauber; Micah Hamady; Noah Fierer; Jeffrey I Gordon; Rob Knight
Journal:  Science       Date:  2009-11-05       Impact factor: 47.728

5.  SlideSort: all pairs similarity search for short reads.

Authors:  Kana Shimizu; Koji Tsuda
Journal:  Bioinformatics       Date:  2010-12-09       Impact factor: 6.937

6.  SEED: efficient clustering of next-generation sequences.

Authors:  Ergude Bao; Tao Jiang; Isgouhi Kaloshian; Thomas Girke
Journal:  Bioinformatics       Date:  2011-08-02       Impact factor: 6.937

7.  DNACLUST: accurate and efficient clustering of phylogenetic marker genes.

Authors:  Mohammadreza Ghodsi; Bo Liu; Mihai Pop
Journal:  BMC Bioinformatics       Date:  2011-06-30       Impact factor: 3.169

8.  Centroid based clustering of high throughput sequencing reads based on n-mer counts.

Authors:  Alexander Solovyov; W Ian Lipkin
Journal:  BMC Bioinformatics       Date:  2013-09-08       Impact factor: 3.169

Review 9.  An overview of the wcd EST clustering tool.

Authors:  Scott Hazelhurst; Winston Hide; Zsuzsanna Lipták; Ramon Nogueira; Richard Starfield
Journal:  Bioinformatics       Date:  2008-05-14       Impact factor: 6.937

10.  CD-HIT: accelerated for clustering the next-generation sequencing data.

Authors:  Limin Fu; Beifang Niu; Zhengwei Zhu; Sitao Wu; Weizhong Li
Journal:  Bioinformatics       Date:  2012-10-11       Impact factor: 6.937

View more
  17 in total

1.  Identity: rapid alignment-free prediction of sequence alignment identity scores using self-supervised general linear models.

Authors:  Hani Z Girgis; Benjamin T James; Brian B Luczak
Journal:  NAR Genom Bioinform       Date:  2021-02-01

2.  Interpreting alignment-free sequence comparison: what makes a score a good score?

Authors:  Martin T Swain; Martin Vickers
Journal:  NAR Genom Bioinform       Date:  2022-09-05

3.  Fonio millet genome unlocks African orphan crop diversity for agriculture in a changing climate.

Authors:  Michael Abrouk; Hanin Ibrahim Ahmed; Philippe Cubry; Denisa Šimoníková; Stéphane Cauet; Yveline Pailles; Jan Bettgenhaeuser; Liubov Gapa; Nora Scarcelli; Marie Couderc; Leila Zekraoui; Nagarajan Kathiresan; Jana Čížková; Eva Hřibová; Jaroslav Doležel; Sandrine Arribat; Hélène Bergès; Jan J Wieringa; Mathieu Gueye; Ndjido A Kane; Christian Leclerc; Sandrine Causse; Sylvie Vancoppenolle; Claire Billot; Thomas Wicker; Yves Vigouroux; Adeline Barnaud; Simon G Krattinger
Journal:  Nat Commun       Date:  2020-09-08       Impact factor: 14.919

4.  De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality Value-Based Algorithm.

Authors:  Kristoffer Sahlin; Paul Medvedev
Journal:  J Comput Biol       Date:  2020-03-16       Impact factor: 1.479

5.  From Alpha to Zeta: Identifying Variants and Subtypes of SARS-CoV-2 Via Clustering.

Authors:  Andrew Melnyk; Fatemeh Mohebbi; Sergey Knyazev; Bikram Sahoo; Roya Hosseini; Pavel Skums; Alex Zelikovsky; Murray Patterson
Journal:  J Comput Biol       Date:  2021-10-25       Impact factor: 1.479

6.  Gulf of Mexico blue hole harbors high levels of novel microbial lineages.

Authors:  N V Patin; Z A Dietrich; A Stancil; M Quinan; J S Beckler; E R Hall; J Culter; C G Smith; M Taillefert; F J Stewart
Journal:  ISME J       Date:  2021-02-21       Impact factor: 11.217

7.  An information-theoretical analysis of gene nucleotide sequence structuredness for a selection of aging and cancer-related genes.

Authors:  David Blokh; Joseph Gitarts; Ilia Stambler
Journal:  Genomics Inform       Date:  2020-12-08

8.  Determination of k-mer density in a DNA sequence and subsequent cluster formation algorithm based on the application of electronic filter.

Authors:  Bimal Kumar Sarkar; Ashish Ranjan Sharma; Manojit Bhattacharya; Garima Sharma; Sang-Soo Lee; Chiranjib Chakraborty
Journal:  Sci Rep       Date:  2021-07-01       Impact factor: 4.379

9.  Reprogramming of Retrotransposon Activity during Speciation of the Genus Citrus.

Authors:  Carles Borredá; Estela Pérez-Román; Victoria Ibanez; Javier Terol; Manuel Talon
Journal:  Genome Biol Evol       Date:  2019-12-01       Impact factor: 3.416

Review 10.  Review of Hepatitis E Virus in Rats: Evident Risk of Species Orthohepevirus C to Human Zoonotic Infection and Disease.

Authors:  Gábor Reuter; Ákos Boros; Péter Pankovics
Journal:  Viruses       Date:  2020-10-09       Impact factor: 5.048

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.