Jiawei Wang1, Jiahui Li1,2, Bingjiao Yang3, Ruopeng Xie3, Tatiana T Marquez-Lago4,5, André Leier4,5, Morihiro Hayashida6, Tatsuya Akutsu7, Yanju Zhang3, Kuo-Chen Chou8,9,10, Joel Selkrig11, Tieli Zhou2, Jiangning Song12,13,14, Trevor Lithgow1. 1. Infection and Immunity Program, Biomedicine Discovery Institute and Department of Microbiology, Monash University, Melbourne, VIC, Australia. 2. Department of Clinical Laboratory, The First Affiliated Hospital of Wenzhou Medical University, Wenzhou, China. 3. School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, China. 4. Department of Genetics, School of Medicine, University of Alabama at Birmingham, AL, USA. 5. Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, AL, USA. 6. National Institute of Technology, Matsue College, Matsue, Shimane, Japan. 7. Bioinformatics Center, Institute for Chemical Research, Kyoto University, Kyoto, Japan. 8. Gordon Life Science Institute, Boston, MA, USA. 9. Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China. 10. Center of Excellence in Genomic Medicine Research (CEGMR), King Abdulaziz University, Jeddah, Saudi Arabia. 11. European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany. 12. Infection and Immunity Program, Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC, Australia. 13. Monash Centre for Data Science, Monash University, Melbourne, VIC, Australia. 14. ARC Centre of Excellence in Advanced Molecular Imaging, Monash University, Melbourne, VIC, Australia.
Abstract
MOTIVATION: Type III secreted effectors (T3SEs) can be injected into host cell cytoplasm via type III secretion systems (T3SSs) to modulate interactions between Gram-negative bacterial pathogens and their hosts. Due to their relevance in pathogen-host interactions, significant computational efforts have been put toward identification of T3SEs and these in turn have stimulated new T3SE discoveries. However, as T3SEs with new characteristics are discovered, these existing computational tools reveal important limitations: (i) most of the trained machine learning models are based on the N-terminus (or incorporating also the C-terminus) instead of the proteins' complete sequences, and (ii) the underlying models (trained with classic algorithms) employed only few features, most of which were extracted based on sequence-information alone. To achieve better T3SE prediction, we must identify more powerful, informative features and investigate how to effectively integrate these into a comprehensive model. RESULTS: In this work, we present Bastion3, a two-layer ensemble predictor developed to accurately identify type III secreted effectors from protein sequence data. In contrast with existing methods that employ single models with few features, Bastion3 explores a wide range of features, from various types, trains single models based on these features and finally integrates these models through ensemble learning. We trained the models using a new gradient boosting machine, LightGBM and further boosted the models' performances through a novel genetic algorithm (GA) based two-step parameter optimization strategy. Our benchmark test demonstrates that Bastion3 achieves a much better performance compared to commonly used methods, with an ACC value of 0.959, F-value of 0.958, MCC value of 0.917 and AUC value of 0.956, which comprehensively outperformed all other toolkits by more than 5.6% in ACC value, 5.7% in F-value, 12.4% in MCC value and 5.8% in AUC value. Based on our proposed two-layer ensemble model, we further developed a user-friendly online toolkit, maximizing convenience for experimental scientists toward T3SE prediction. With its design to ease future discoveries of novel T3SEs and improved performance, Bastion3 is poised to become a widely used, state-of-the-art toolkit for T3SE prediction. AVAILABILITY AND IMPLEMENTATION: http://bastion3.erc.monash.edu/. CONTACT: selkrig@embl.de or wyztli@163.com or or trevor.lithgow@monash.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
MOTIVATION: Type III secreted effectors (T3SEs) can be injected into host cell cytoplasm via type III secretion systems (T3SSs) to modulate interactions between Gram-negative bacterial pathogens and their hosts. Due to their relevance in pathogen-host interactions, significant computational efforts have been put toward identification of T3SEs and these in turn have stimulated new T3SE discoveries. However, as T3SEs with new characteristics are discovered, these existing computational tools reveal important limitations: (i) most of the trained machine learning models are based on the N-terminus (or incorporating also the C-terminus) instead of the proteins' complete sequences, and (ii) the underlying models (trained with classic algorithms) employed only few features, most of which were extracted based on sequence-information alone. To achieve better T3SE prediction, we must identify more powerful, informative features and investigate how to effectively integrate these into a comprehensive model. RESULTS: In this work, we present Bastion3, a two-layer ensemble predictor developed to accurately identify type III secreted effectors from protein sequence data. In contrast with existing methods that employ single models with few features, Bastion3 explores a wide range of features, from various types, trains single models based on these features and finally integrates these models through ensemble learning. We trained the models using a new gradient boosting machine, LightGBM and further boosted the models' performances through a novel genetic algorithm (GA) based two-step parameter optimization strategy. Our benchmark test demonstrates that Bastion3 achieves a much better performance compared to commonly used methods, with an ACC value of 0.959, F-value of 0.958, MCC value of 0.917 and AUC value of 0.956, which comprehensively outperformed all other toolkits by more than 5.6% in ACC value, 5.7% in F-value, 12.4% in MCC value and 5.8% in AUC value. Based on our proposed two-layer ensemble model, we further developed a user-friendly online toolkit, maximizing convenience for experimental scientists toward T3SE prediction. With its design to ease future discoveries of novel T3SEs and improved performance, Bastion3 is poised to become a widely used, state-of-the-art toolkit for T3SE prediction. AVAILABILITY AND IMPLEMENTATION: http://bastion3.erc.monash.edu/. CONTACT: selkrig@embl.de or wyztli@163.com or or trevor.lithgow@monash.edu. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.