Literature DB >> 26599828

Unsupervised learning assisted robust prediction of bioluminescent proteins.

Abhigyan Nath1, Karthikeyan Subbiah2.   

Abstract

Bioluminescence plays an important role in nature, for example, it is used for intracellular chemical signalling in bacteria. It is also used as a useful reagent for various analytical research methods ranging from cellular imaging to gene expression analysis. However, identification and annotation of bioluminescent proteins is a difficult task as they share poor sequence similarities among them. In this paper, we present a novel approach for within-class and between-class balancing as well as diversifying of a training dataset by effectively combining unsupervised K-Means algorithm with Synthetic Minority Oversampling Technique (SMOTE) in order to achieve the true performance of the prediction model. Further, we experimented by varying different levels of balancing ratio of positive data to negative data in the training dataset in order to probe for an optimal class distribution which produces the best prediction accuracy. The appropriately balanced and diversified training set resulted in near complete learning with greater generalization on the blind test datasets. The obtained results strongly justify the fact that optimal class distribution with a high degree of diversity is an essential factor to achieve near perfect learning. Using random forest as the weak learners in boosting and training it on the optimally balanced and diversified training dataset, we achieved an overall accuracy of 95.3% on a tenfold cross validation test, and an accuracy of 91.7%, sensitivity of 89. 3% and specificity of 91.8% on a holdout test set. It is quite possible that the general framework discussed in the current work can be successfully applied to other biological datasets to deal with imbalance and incomplete learning problems effectively.
Copyright © 2015 Elsevier Ltd. All rights reserved.

Keywords:  Class imbalance; K-Means; Optimal class distribution; SMOTE; Training set diversity

Mesh:

Substances:

Year:  2015        PMID: 26599828     DOI: 10.1016/j.compbiomed.2015.10.013

Source DB:  PubMed          Journal:  Comput Biol Med        ISSN: 0010-4825            Impact factor:   4.589


  3 in total

1.  Probing an optimal class distribution for enhancing prediction and feature characterization of plant virus-encoded RNA-silencing suppressors.

Authors:  Abhigyan Nath; Karthikeyan Subbiah
Journal:  3 Biotech       Date:  2016-03-21       Impact factor: 2.406

2.  Prediction of bioluminescent proteins by using sequence-derived features and lineage-specific scheme.

Authors:  Jian Zhang; Haiting Chai; Guifu Yang; Zhiqiang Ma
Journal:  BMC Bioinformatics       Date:  2017-06-05       Impact factor: 3.169

3.  iBLP: An XGBoost-Based Predictor for Identifying Bioluminescent Proteins.

Authors:  Dan Zhang; Hua-Dong Chen; Hasan Zulfiqar; Shi-Shi Yuan; Qin-Lai Huang; Zhao-Yue Zhang; Ke-Jun Deng
Journal:  Comput Math Methods Med       Date:  2021-01-07       Impact factor: 2.238

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.