Literature DB >> 31393553

PeNGaRoo, a combined gradient boosting and ensemble learning framework for predicting non-classical secreted proteins.

Yanju Zhang1, Sha Yu1,2, Ruopeng Xie1,2, Jiahui Li1,3, André Leier4,5, Tatiana T Marquez-Lago4,5, Tatsuya Akutsu6, A Ian Smith2,7, Zongyuan Ge8, Jiawei Wang3, Trevor Lithgow3, Jiangning Song2,7.   

Abstract

MOTIVATION: Gram-positive bacteria have developed secretion systems to transport proteins across their cell wall, a process that plays an important role during host infection. These secretion mechanisms have also been harnessed for therapeutic purposes in many biotechnology applications. Accordingly, the identification of features that select a protein for efficient secretion from these microorganisms has become an important task. Among all the secreted proteins, 'non-classical' secreted proteins are difficult to identify as they lack discernable signal peptide sequences and can make use of diverse secretion pathways. Currently, several computational methods have been developed to facilitate the discovery of such non-classical secreted proteins; however, the existing methods are based on either simulated or limited experimental datasets. In addition, they often employ basic features to train the models in a simple and coarse-grained manner. The availability of more experimentally validated datasets, advanced feature engineering techniques and novel machine learning approaches creates new opportunities for the development of improved predictors of 'non-classical' secreted proteins from sequence data.
RESULTS: In this work, we first constructed a high-quality dataset of experimentally verified 'non-classical' secreted proteins, which we then used to create benchmark datasets. Using these benchmark datasets, we comprehensively analyzed a wide range of features and assessed their individual performance. Subsequently, we developed a two-layer Light Gradient Boosting Machine (LightGBM) ensemble model that integrates several single feature-based models into an overall prediction framework. At this stage, LightGBM, a gradient boosting machine, was used as a machine learning approach and the necessary parameter optimization was performed by a particle swarm optimization strategy. All single feature-based LightGBM models were then integrated into a unified ensemble model to further improve the predictive performance. Consequently, the final ensemble model achieved a superior performance with an accuracy of 0.900, an F-value of 0.903, Matthew's correlation coefficient of 0.803 and an area under the curve value of 0.963, and outperforming previous state-of-the-art predictors on the independent test. Based on our proposed optimal ensemble model, we further developed an accessible online predictor, PeNGaRoo, to serve users' demands. We believe this online web server, together with our proposed methodology, will expedite the discovery of non-classically secreted effector proteins in Gram-positive bacteria and further inspire the development of next-generation predictors.
AVAILABILITY AND IMPLEMENTATION: http://pengaroo.erc.monash.edu/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
© The Author(s) 2019. Published by Oxford University Press. All rights reserved. For permissions, please e-mail: journals.permissions@oup.com.

Entities:  

Mesh:

Substances:

Year:  2020        PMID: 31393553     DOI: 10.1093/bioinformatics/btz629

Source DB:  PubMed          Journal:  Bioinformatics        ISSN: 1367-4803            Impact factor:   6.937


  11 in total

1.  PaCRISPR: a server for predicting and visualizing anti-CRISPR proteins.

Authors:  Jiawei Wang; Wei Dai; Jiahui Li; Ruopeng Xie; Rhys A Dunstan; Christopher Stubenrauch; Yanju Zhang; Trevor Lithgow
Journal:  Nucleic Acids Res       Date:  2020-07-02       Impact factor: 16.971

2.  ATGPred-FL: sequence-based prediction of autophagy proteins with feature representation learning.

Authors:  Shihu Jiao; Zheng Chen; Lichao Zhang; Xun Zhou; Lei Shi
Journal:  Amino Acids       Date:  2022-03-14       Impact factor: 3.520

3.  MathFeature: feature extraction package for DNA, RNA and protein sequences based on mathematical descriptors.

Authors:  Robson P Bonidia; Douglas S Domingues; Danilo S Sanches; André C P L F de Carvalho
Journal:  Brief Bioinform       Date:  2022-01-17       Impact factor: 11.622

4.  ASPIRER: a new computational approach for identifying non-classical secreted proteins based on deep learning.

Authors:  Xiaoyu Wang; Fuyi Li; Jing Xu; Jia Rong; Geoffrey I Webb; Zongyuan Ge; Jian Li; Jiangning Song
Journal:  Brief Bioinform       Date:  2022-03-10       Impact factor: 13.994

5.  Extremely-randomized-tree-based Prediction of N6-Methyladenosine Sites in Saccharomyces cerevisiae.

Authors:  Rajiv G Govindaraj; Sathiyamoorthy Subramaniyam; Balachandran Manavalan
Journal:  Curr Genomics       Date:  2020-01       Impact factor: 2.236

6.  i6mA-Fuse: improved and robust prediction of DNA 6 mA sites in the Rosaceae genome by fusing multiple feature representation.

Authors:  Md Mehedi Hasan; Balachandran Manavalan; Watshara Shoombuatong; Mst Shamima Khatun; Hiroyuki Kurata
Journal:  Plant Mol Biol       Date:  2020-03-05       Impact factor: 4.076

Review 7.  Large-scale comparative review and assessment of computational methods for anti-cancer peptide identification.

Authors:  Xiao Liang; Fuyi Li; Jinxiang Chen; Junlong Li; Hao Wu; Shuqin Li; Jiangning Song; Quanzhong Liu
Journal:  Brief Bioinform       Date:  2021-07-20       Impact factor: 11.622

8.  DeepT3_4: A Hybrid Deep Neural Network Model for the Distinction Between Bacterial Type III and IV Secreted Effectors.

Authors:  Lezheng Yu; Fengjuan Liu; Yizhou Li; Jiesi Luo; Runyu Jing
Journal:  Front Microbiol       Date:  2021-01-21       Impact factor: 5.640

9.  PncsHub: a platform for annotating and analyzing non-classically secreted proteins in Gram-positive bacteria.

Authors:  Wei Dai; Jiahui Li; Qi Li; Jiasheng Cai; Jianzhong Su; Christopher Stubenrauch; Jiawei Wang
Journal:  Nucleic Acids Res       Date:  2022-01-07       Impact factor: 16.971

10.  NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data.

Authors:  Chao Wang; Jin Wu; Lei Xu; Quan Zou
Journal:  Microb Genom       Date:  2020-11-27
View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.