Literature DB >> 26863669

A Comparative Analysis Between k-Mers and Community Detection-Based Features for the Task of Protein Classification.

Karthik Tangirala, Nic Herndon, Doina Caragea.   

Abstract

Machine learning algorithms are widely used to annotate biological sequences. Low-dimensional informative feature vectors can be crucial for the performance of the algorithms. In prior work, we have proposed the use of a community detection approach to construct low dimensional feature sets for nucleotide sequence classification. Our approach used the Hamming distance between short nucleotide subsequences, called k-mers, to construct a network, and subsequently used community detection to identify groups of k -mers that appear frequently in a set of sequences. Whereas this approach worked well for nucleotide sequence classification, it could not be directly used for protein sequences, as the Hamming distance is not a good measure for comparing short protein k-mers. To address this limitation, we extended our prior approach by replacing the Hamming distance with substitution scores. Experimental results in different learning scenarios show that the features generated with the new approach are more informative than k-mers.

Entities:  

Mesh:

Substances:

Year:  2016        PMID: 26863669      PMCID: PMC6245644          DOI: 10.1109/TNB.2016.2523501

Source DB:  PubMed          Journal:  IEEE Trans Nanobioscience        ISSN: 1536-1241            Impact factor:   2.935


  17 in total

1.  Finding composite regulatory patterns in DNA sequences.

Authors:  Eleazar Eskin; Pavel A Pevzner
Journal:  Bioinformatics       Date:  2002       Impact factor: 6.937

Review 2.  Community structure in social and biological networks.

Authors:  M Girvan; M E J Newman
Journal:  Proc Natl Acad Sci U S A       Date:  2002-06-11       Impact factor: 11.205

3.  PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis.

Authors:  J L Gardy; M R Laird; F Chen; S Rey; C J Walsh; M Ester; F S L Brinkman
Journal:  Bioinformatics       Date:  2004-10-22       Impact factor: 6.937

4.  Protein classification based on text document classification techniques.

Authors:  Betty Yee Man Cheng; Jaime G Carbonell; Judith Klein-Seetharaman
Journal:  Proteins       Date:  2005-03-01

5.  Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences.

Authors:  P Bucher
Journal:  J Mol Biol       Date:  1990-04-20       Impact factor: 5.469

6.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment.

Authors:  C E Lawrence; S F Altschul; M S Boguski; J S Liu; A F Neuwald; J C Wootton
Journal:  Science       Date:  1993-10-08       Impact factor: 47.728

7.  Predicting subcellular localization of proteins based on their N-terminal amino acid sequence.

Authors:  O Emanuelsson; H Nielsen; S Brunak; G von Heijne
Journal:  J Mol Biol       Date:  2000-07-21       Impact factor: 5.469

8.  A survey of motif discovery methods in an integrated framework.

Authors:  Geir Kjetil Sandve; Finn Drabløs
Journal:  Biol Direct       Date:  2006-04-06       Impact factor: 4.540

9.  DRIMust: a web server for discovering rank imbalanced motifs using suffix trees.

Authors:  Limor Leibovich; Inbal Paz; Zohar Yakhini; Yael Mandel-Gutfreund
Journal:  Nucleic Acids Res       Date:  2013-05-17       Impact factor: 16.971

10.  A fast weak motif-finding algorithm based on community detection in graphs.

Authors:  Caiyan Jia; Matthew B Carson; Jian Yu
Journal:  BMC Bioinformatics       Date:  2013-07-17       Impact factor: 3.169

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.