Literature DB >> 15465476

Classification and knowledge discovery in protein databases.

Predrag Radivojac1, Nitesh V Chawla, A Keith Dunker, Zoran Obradovic.   

Abstract

We consider the problem of classification in noisy, high-dimensional, and class-imbalanced protein datasets. In order to design a complete classification system, we use a three-stage machine learning framework consisting of a feature selection stage, a method addressing noise and class-imbalance, and a method for combining biologically related tasks through a prior-knowledge based clustering. In the first stage, we employ Fisher's permutation test as a feature selection filter. Comparisons with the alternative criteria show that it may be favorable for typical protein datasets. In the second stage, noise and class imbalance are addressed by using minority class over-sampling, majority class under-sampling, and ensemble learning. The performance of logistic regression models, decision trees, and neural networks is systematically evaluated. The experimental results show that in many cases ensembles of logistic regression classifiers may outperform more expressive models due to their robustness to noise and low sample density in a high-dimensional feature space. However, ensembles of neural networks may be the best solution for large datasets. In the third stage, we use prior knowledge to partition unlabeled data such that the class distributions among non-overlapping clusters significantly differ. In our experiments, training classifiers specialized to the class distributions of each cluster resulted in a further decrease in classification error.

Mesh:

Substances:

Year:  2004        PMID: 15465476     DOI: 10.1016/j.jbi.2004.07.008

Source DB:  PubMed          Journal:  J Biomed Inform        ISSN: 1532-0464            Impact factor:   6.317


  8 in total

1.  Imbalanced class learning in epigenetics.

Authors:  M Muksitul Haque; Michael K Skinner; Lawrence B Holder
Journal:  J Comput Biol       Date:  2014-05-05       Impact factor: 1.479

Review 2.  Intrinsic disorder and functional proteomics.

Authors:  Predrag Radivojac; Lilia M Iakoucheva; Christopher J Oldfield; Zoran Obradovic; Vladimir N Uversky; A Keith Dunker
Journal:  Biophys J       Date:  2006-12-08       Impact factor: 4.033

3.  Analysis of structured and intrinsically disordered regions of transmembrane proteins.

Authors:  Bin Xue; Liwei Li; Samy O Meroueh; Vladimir N Uversky; A Keith Dunker
Journal:  Mol Biosyst       Date:  2009-12

4.  SMOTE for high-dimensional class-imbalanced data.

Authors:  Rok Blagus; Lara Lusa
Journal:  BMC Bioinformatics       Date:  2013-03-22       Impact factor: 3.169

5.  Predicting protein disorder by analyzing amino acid sequence.

Authors:  Jack Y Yang; Mary Qu Yang
Journal:  BMC Genomics       Date:  2008-09-16       Impact factor: 3.969

6.  Joint use of over- and under-sampling techniques and cross-validation for the development and assessment of prediction models.

Authors:  Rok Blagus; Lara Lusa
Journal:  BMC Bioinformatics       Date:  2015-11-04       Impact factor: 3.169

7.  Iterative nearest neighborhood oversampling in semisupervised learning from imbalanced data.

Authors:  Fengqi Li; Chuang Yu; Nanhai Yang; Feng Xia; Guangming Li; Fatemeh Kaveh-Yazdy
Journal:  ScientificWorldJournal       Date:  2013-07-10

8.  Global Phosphoproteomic Analysis Reveals the Involvement of Phosphorylation in Aflatoxins Biosynthesis in the Pathogenic Fungus Aspergillus flavus.

Authors:  Silin Ren; Mingkun Yang; Yu Li; Feng Zhang; Zhuo Chen; Jia Zhang; Guang Yang; Yuewei Yue; Siting Li; Feng Ge; Shihua Wang
Journal:  Sci Rep       Date:  2016-09-26       Impact factor: 4.379

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.