Literature DB >> 16231961

Feature selection and the class imbalance problem in predicting protein function from sequence.

Ali Al-Shahib1, Rainer Breitling, David Gilbert.   

Abstract

When the standard approach to predict protein function by sequence homology fails, other alternative methods can be used that require only the amino acid sequence for predicting function. One such approach uses machine learning to predict protein function directly from amino acid sequence features. However, there are two issues to consider before successful functional prediction can take place: identifying discriminatory features, and overcoming the challenge of a large imbalance in the training data. We show that by applying feature subset selection followed by undersampling of the majority class, significantly better support vector machine (SVM) classifiers are generated compared with standard machine learning approaches. As well as revealing that the features selected could have the potential to advance our understanding of the relationship between sequence and function, we also show that undersampling to produce fully balanced data significantly improves performance. The best discriminating ability is achieved using SVMs together with feature selection and full undersampling; this approach strongly outperforms other competitive learning algorithms. We conclude that this combined approach can generate powerful machine learning classifiers for predicting protein function directly from sequence.

Entities:  

Mesh:

Substances:

Year:  2005        PMID: 16231961     DOI: 10.2165/00822942-200504030-00004

Source DB:  PubMed          Journal:  Appl Bioinformatics        ISSN: 1175-5636


  11 in total

1.  A top-down approach to classify enzyme functional classes and sub-classes using random forest.

Authors:  Chetan Kumar; Alok Choudhary
Journal:  EURASIP J Bioinform Syst Biol       Date:  2012-02-29

2.  PoGO: Prediction of Gene Ontology terms for fungal proteins.

Authors:  Jaehee Jung; Gangman Yi; Serenella A Sukno; Michael R Thon
Journal:  BMC Bioinformatics       Date:  2010-04-29       Impact factor: 3.169

3.  Identification of protein functions using a machine-learning approach based on sequence-derived properties.

Authors:  Bum Ju Lee; Moon Sun Shin; Young Joon Oh; Hae Seok Oh; Keun Ho Ryu
Journal:  Proteome Sci       Date:  2009-08-09       Impact factor: 2.480

4.  Class prediction for high-dimensional class-imbalanced data.

Authors:  Rok Blagus; Lara Lusa
Journal:  BMC Bioinformatics       Date:  2010-10-20       Impact factor: 3.169

5.  Predicting deleterious nsSNPs: an analysis of sequence and structural attributes.

Authors:  Richard J Dobson; Patricia B Munroe; Mark J Caulfield; Mansoor As Saqi
Journal:  BMC Bioinformatics       Date:  2006-04-21       Impact factor: 3.169

6.  Predicting protein function by machine learning on amino acid sequences--a critical evaluation.

Authors:  Ali Al-Shahib; Rainer Breitling; David R Gilbert
Journal:  BMC Genomics       Date:  2007-03-20       Impact factor: 3.969

7.  A novel method for functional annotation prediction based on combination of classification methods.

Authors:  Jaehee Jung; Heung Ki Lee; Gangman Yi
Journal:  ScientificWorldJournal       Date:  2014-07-16

8.  Radiomics-based Prognosis Analysis for Non-Small Cell Lung Cancer.

Authors:  Yucheng Zhang; Anastasia Oikonomou; Alexander Wong; Masoom A Haider; Farzad Khalvati
Journal:  Sci Rep       Date:  2017-04-18       Impact factor: 4.379

9.  How Well Does a Sequential Minimal Optimization Model Perform in Predicting Medicine Prices for Procurement System?

Authors:  Amarawan Pentrakan; Cheng-Chia Yang; Wing-Keung Wong
Journal:  Int J Environ Res Public Health       Date:  2021-05-21       Impact factor: 3.390

10.  Prediction of functional class of proteins and peptides irrespective of sequence homology by support vector machines.

Authors:  Zhi Qun Tang; Hong Huang Lin; Hai Lei Zhang; Lian Yi Han; Xin Chen; Yu Zong Chen
Journal:  Bioinform Biol Insights       Date:  2009-11-24
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.