Literature DB >> 26038978

Comparing the Influence of Simulated Experimental Errors on 12 Machine Learning Algorithms in Bioactivity Modeling Using 12 Diverse Data Sets.

Isidro Cortes-Ciriano1, Andreas Bender2, Thérèse E Malliavin1.   

Abstract

To date, no systematic study has assessed the effect of random experimental errors on the predictive power of QSAR models. To address this shortage, we have benchmarked the noise sensitivity of 12 learning algorithms on 12 data sets (15,840 models in total), namely the following: Support Vector Machines (SVM) with radial and polynomial (Poly) kernels, Gaussian Process (GP) with radial and polynomial kernels, Relevant Vector Machines (radial kernel), Random Forest (RF), Gradient Boosting Machines (GBM), Bagged Regression Trees, Partial Least Squares, and k-Nearest Neighbors. Model performance on the test set was used as a proxy to monitor the relative noise sensitivity of these algorithms as a function of the level of simulated noise added to the bioactivities from the training set. The noise was simulated by sampling from Gaussian distributions with increasingly larger variances, which ranged from zero to the range of pIC50 values comprised in a given data set. General trends were identified by designing a full-factorial experiment, which was analyzed with a normal linear model. Overall, GBM displayed low noise tolerance, although its performance was comparable to RF, SVM Radial, SVM Poly, GP Poly, and GP Radial at low noise levels. Of practical relevance, we show that the bag fraction parameter has a marked influence on the noise sensitivity of GBM, suggesting that low values (e.g., 0.1-0.2) for this parameter should be set when modeling noisy data. The remaining 11 algorithms display a comparable noise tolerance, as a smooth and linear degradation of model performance is observed with the level of noise. However, SVM Poly and GP Poly display significant noise sensitivity at high noise levels in some cases. Overall, these results provide a practical guide to make informed decisions about which algorithm and parameter values to use according to the noise level present in the data.

Entities:  

Mesh:

Year:  2015        PMID: 26038978     DOI: 10.1021/acs.jcim.5b00101

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  6 in total

Review 1.  Advancing computer-aided drug discovery (CADD) by big data and data-driven machine learning modeling.

Authors:  Linlin Zhao; Heather L Ciallella; Lauren M Aleksunes; Hao Zhu
Journal:  Drug Discov Today       Date:  2020-07-11       Impact factor: 7.851

2.  Compilation and physicochemical classification analysis of a diverse hERG inhibition database.

Authors:  Remigijus Didziapetris; Kiril Lanevskij
Journal:  J Comput Aided Mol Des       Date:  2016-10-25       Impact factor: 3.686

Review 3.  ASAS-NANP symposium: mathematical modeling in animal nutrition: limitations and potential next steps for modeling and modelers in the animal sciences.

Authors:  Marc Jacobs; Aline Remus; Charlotte Gaillard; Hector M Menendez; Luis O Tedeschi; Suresh Neethirajan; Jennifer L Ellis
Journal:  J Anim Sci       Date:  2022-06-01       Impact factor: 3.338

4.  Improved large-scale prediction of growth inhibition patterns using the NCI60 cancer cell line panel.

Authors:  Isidro Cortés-Ciriano; Gerard J P van Westen; Guillaume Bouvier; Michael Nilges; John P Overington; Andreas Bender; Thérèse E Malliavin
Journal:  Bioinformatics       Date:  2015-09-08       Impact factor: 6.937

5.  Experimental Errors in QSAR Modeling Sets: What We Can Do and What We Cannot Do.

Authors:  Linlin Zhao; Wenyi Wang; Alexander Sedykh; Hao Zhu
Journal:  ACS Omega       Date:  2017-06-19

6.  QSAR-derived affinity fingerprints (part 2): modeling performance for potency prediction.

Authors:  Isidro Cortés-Ciriano; Ctibor Škuta; Andreas Bender; Daniel Svozil
Journal:  J Cheminform       Date:  2020-06-05       Impact factor: 5.514

  6 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.