Literature DB >> 15645499

Protein classification based on text document classification techniques.

Betty Yee Man Cheng1, Jaime G Carbonell, Judith Klein-Seetharaman.   

Abstract

The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively. Copyright 2005 Wiley-Liss, Inc.

Mesh:

Substances:

Year:  2005        PMID: 15645499     DOI: 10.1002/prot.20373

Source DB:  PubMed          Journal:  Proteins        ISSN: 0887-3585


  14 in total

1.  A Comparative Analysis Between k-Mers and Community Detection-Based Features for the Task of Protein Classification.

Authors:  Karthik Tangirala; Nic Herndon; Doina Caragea
Journal:  IEEE Trans Nanobioscience       Date:  2016-02-03       Impact factor: 2.935

2.  Mining Cytochrome b561 proteins from plant genomes.

Authors:  Stephen O Opiyo; Etsuko N Moriyama
Journal:  Int J Bioinform Res Appl       Date:  2010

3.  Augmented training of hidden Markov models to recognize remote homologs via simulated evolution.

Authors:  Anoop Kumar; Lenore Cowen
Journal:  Bioinformatics       Date:  2009-04-23       Impact factor: 6.937

4.  GOPred: GO molecular function prediction by combined classifiers.

Authors:  Omer Sinan Saraç; Volkan Atalay; Rengul Cetin-Atalay
Journal:  PLoS One       Date:  2010-08-31       Impact factor: 3.240

5.  N-gram analysis of 970 microbial organisms reveals presence of biological language models.

Authors:  Hatice Ulku Osmanbeyoglu; Madhavi K Ganapathiraju
Journal:  BMC Bioinformatics       Date:  2011-01-10       Impact factor: 3.169

6.  Probabilistic annotation of protein sequences based on functional classifications.

Authors:  Emmanuel D Levy; Christos A Ouzounis; Walter R Gilks; Benjamin Audit
Journal:  BMC Bioinformatics       Date:  2005-12-14       Impact factor: 3.169

7.  Application of a hierarchical enzyme classification method reveals the role of gut microbiome in human metabolism.

Authors:  Akram Mohammed; Chittibabu Guda
Journal:  BMC Genomics       Date:  2015-06-11       Impact factor: 3.969

8.  ngLOC: an n-gram-based Bayesian method for estimating the subcellular proteomes of eukaryotes.

Authors:  Brian R King; Chittibabu Guda
Journal:  Genome Biol       Date:  2007       Impact factor: 13.583

9.  Protein subcellular localization prediction based on compartment-specific features and structure conservation.

Authors:  Emily Chia-Yu Su; Hua-Sheng Chiu; Allan Lo; Jenn-Kang Hwang; Ting-Yi Sung; Wen-Lian Hsu
Journal:  BMC Bioinformatics       Date:  2007-09-08       Impact factor: 3.169

10.  Sequence and structure based models of HIV-1 protease and reverse transcriptase drug resistance.

Authors:  Majid Masso; Iosif I Vaisman
Journal:  BMC Genomics       Date:  2013-10-01       Impact factor: 3.969

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.