Literature DB >> 16423423

n-gram-based classification and unsupervised hierarchical clustering of genome sequences.

Andrija Tomović1, Predrag Janicić, Vlado Keselj.   

Abstract

In this paper we address the problem of automated classification of isolates, i.e., the problem of determining the family of genomes to which a given genome belongs. Additionally, we address the problem of automated unsupervised hierarchical clustering of isolates according only to their statistical substring properties. For both of these problems we present novel algorithms based on nucleotide n-grams, with no required preprocessing steps such as sequence alignment. Results obtained experimentally are very positive and suggest that the proposed techniques can be successfully used in a variety of related problems. The reported experiments demonstrate better performance than some of the state-of-the-art methods. We report on a new distance measure between n-gram profiles, which shows superior performance compared to many other measures, including commonly used Euclidean distance.

Entities:  

Mesh:

Year:  2006        PMID: 16423423     DOI: 10.1016/j.cmpb.2005.11.007

Source DB:  PubMed          Journal:  Comput Methods Programs Biomed        ISSN: 0169-2607            Impact factor:   5.428


  10 in total

1.  DeepDeath: Learning to predict the underlying cause of death with Big Data.

Authors:  Hamid Reza Hassanzadeh; May D Wang
Journal:  Conf Proc IEEE Eng Med Biol Soc       Date:  2017-07

2.  CRISPRclassify: Repeat-Based Classification of CRISPR Loci.

Authors:  Matthew A Nethery; Michael Korvink; Kira S Makarova; Yuri I Wolf; Eugene V Koonin; Rodolphe Barrangou
Journal:  CRISPR J       Date:  2021-08

3.  Genome classification by gene distribution: an overlapping subspace clustering approach.

Authors:  Jason Li; Saman K Halgamuge; Sen-Lin Tang
Journal:  BMC Evol Biol       Date:  2008-04-23       Impact factor: 3.260

4.  N-gram analysis of 970 microbial organisms reveals presence of biological language models.

Authors:  Hatice Ulku Osmanbeyoglu; Madhavi K Ganapathiraju
Journal:  BMC Bioinformatics       Date:  2011-01-10       Impact factor: 3.169

5.  Effective computational detection of piRNAs using n-gram models and support vector machine.

Authors:  Chun-Chi Chen; Xiaoning Qian; Byung-Jun Yoon
Journal:  BMC Bioinformatics       Date:  2017-12-28       Impact factor: 3.169

6.  CFSP: a collaborative frequent sequence pattern discovery algorithm for nucleic acid sequence classification.

Authors:  He Peng
Journal:  PeerJ       Date:  2020-04-20       Impact factor: 2.984

7.  Reaction classification and yield prediction using the differential reaction fingerprint DRFP.

Authors:  Daniel Probst; Philippe Schwaller; Jean-Louis Reymond
Journal:  Digit Discov       Date:  2022-01-21

8.  A new systematic computational approach to predicting target genes of transcription factors.

Authors:  Xinbin Dai; Ji He; Xuechun Zhao
Journal:  Nucleic Acids Res       Date:  2007-06-18       Impact factor: 16.971

9.  n-Gram characterization of genomic islands in bacterial genomes.

Authors:  Gordana M Pavlović-Lazetić; Nenad S Mitić; Milos V Beljanski
Journal:  Comput Methods Programs Biomed       Date:  2008-12-19       Impact factor: 5.428

10.  On the Verge of Life: Distribution of Nucleotide Sequences in Viral RNAs.

Authors:  Mykola Husev; Andrij Rovenchak
Journal:  Biosemiotics       Date:  2021-02-17       Impact factor: 0.711

  10 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.