Literature DB >> 33313673

Do we need different machine learning algorithms for QSAR modeling? A comprehensive assessment of 16 machine learning algorithms on 14 QSAR data sets.

Zhenxing Wu1, Minfeng Zhu2, Yu Kang1, Elaine Lai-Han Leung3, Tailong Lei1, Chao Shen1, Dejun Jiang1, Zhe Wang1, Dongsheng Cao4, Tingjun Hou5.   

Abstract

Although a wide variety of machine learning (ML) algorithms have been utilized to learn quantitative structure-activity relationships (QSARs), there is no agreed single best algorithm for QSAR learning. Therefore, a comprehensive understanding of the performance characteristics of popular ML algorithms used in QSAR learning is highly desirable. In this study, five linear algorithms [linear function Gaussian process regression (linear-GPR), linear function support vector machine (linear-SVM), partial least squares regression (PLSR), multiple linear regression (MLR) and principal component regression (PCR)], three analogizers [radial basis function support vector machine (rbf-SVM), K-nearest neighbor (KNN) and radial basis function Gaussian process regression (rbf-GPR)], six symbolists [extreme gradient boosting (XGBoost), Cubist, random forest (RF), multiple adaptive regression splines (MARS), gradient boosting machine (GBM), and classification and regression tree (CART)] and two connectionists [principal component analysis artificial neural network (pca-ANN) and deep neural network (DNN)] were employed to learn the regression-based QSAR models for 14 public data sets comprising nine physicochemical properties and five toxicity endpoints. The results show that rbf-SVM, rbf-GPR, XGBoost and DNN generally illustrate better performances than the other algorithms. The overall performances of different algorithms can be ranked from the best to the worst as follows: rbf-SVM > XGBoost > rbf-GPR > Cubist > GBM > DNN > RF > pca-ANN > MARS > linear-GPR ≈ KNN > linear-SVM ≈ PLSR > CART ≈ PCR ≈ MLR. In terms of prediction accuracy and computational efficiency, SVM and XGBoost are recommended to the regression learning for small data sets, and XGBoost is an excellent choice for large data sets. We then investigated the performances of the ensemble models by integrating the predictions of multiple ML algorithms. The results illustrate that the ensembles of two or three algorithms in different categories can indeed improve the predictions of the best individual ML algorithms.
© The Author(s) 2020. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oup.com.

Entities:  

Keywords:  QSAR; XGBoost; ensemble learning; machine learning; support vector machine

Year:  2021        PMID: 33313673     DOI: 10.1093/bib/bbaa321

Source DB:  PubMed          Journal:  Brief Bioinform        ISSN: 1467-5463            Impact factor:   11.622


  7 in total

1.  A Discovery Strategy for Active Compounds of Chinese Medicine Based on the Prediction Model of Compound-Disease Relationship.

Authors:  Mengqi Huo; Sha Peng; Jing Li; Yanling Zhang; Yanjiang Qiao
Journal:  J Oncol       Date:  2022-07-08       Impact factor: 4.501

2.  An Algorithm Framework for Drug-Induced Liver Injury Prediction Based on Genetic Algorithm and Ensemble Learning.

Authors:  Bowei Yan; Xiaona Ye; Jing Wang; Junshan Han; Lianlian Wu; Song He; Kunhong Liu; Xiaochen Bo
Journal:  Molecules       Date:  2022-05-12       Impact factor: 4.927

3.  Predicting acupuncture efficacy for functional dyspepsia based on routine clinical features: a machine learning study in the framework of predictive, preventive, and personalized medicine.

Authors:  Tao Yin; Hui Zheng; Tingting Ma; Xiaoping Tian; Jing Xu; Ying Li; Lei Lan; Mailan Liu; Ruirui Sun; Yong Tang; Fanrong Liang; Fang Zeng
Journal:  EPMA J       Date:  2022-02-02       Impact factor: 6.543

4.  A comparative mapping of plant species diversity using ensemble learning algorithms combined with high accuracy surface modeling.

Authors:  Yapeng Zhao; Xiaozhe Yin; Yan Fu; Tianxiang Yue
Journal:  Environ Sci Pollut Res Int       Date:  2021-10-21       Impact factor: 4.223

5.  Complex metabolic interactions between ovary, plasma, urine, and hair in ovarian cancer.

Authors:  Xiaocui Zhong; Rui Ran; Shanhu Gao; Manlin Shi; Xian Shi; Fei Long; Yanqiu Zhou; Yang Yang; Xianglan Tang; Anping Lin; Wuyang He; Tinghe Yu; Ting-Li Han
Journal:  Front Oncol       Date:  2022-08-02       Impact factor: 5.738

6.  Deqi Sensation to Predict Acupuncture Effect on Functional Dyspepsia: A Machine Learning Study.

Authors:  Li Chen; Tao Yin; Zhaoxuan He; Yuan Chen; Ruirui Sun; Jin Lu; Peihong Ma; Fang Zeng
Journal:  Evid Based Complement Alternat Med       Date:  2022-09-14       Impact factor: 2.650

7.  A Methylation Diagnostic Model Based on Random Forests and Neural Networks for Asthma Identification.

Authors:  Dong-Dong Li; Ting Chen; You-Liang Ling; YongAn Jiang; Qiu-Gen Li
Journal:  Comput Math Methods Med       Date:  2022-09-28       Impact factor: 2.809

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.