Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Improved Chemical Structure-Activity Modeling Through Data Augmentation.

Literature DB >> 26619900

Improved Chemical Structure-Activity Modeling Through Data Augmentation.

Isidro Cortes-Ciriano¹, Andreas Bender².

Abstract

Extending the original training data with simulated unobserved data points has proven powerful to increase both the generalization ability of predictive models and their robustness against changes in the structure of data (e.g., systematic drifts in the response variable) in diverse areas such as the analysis of spectroscopic data or the detection of conserved domains in protein sequences. In this contribution, we explore the effect of data augmentation in the predictive power of QSAR models, quantified by the RMSE values on the test set. We collected 8 diverse data sets from the literature and ChEMBL version 19 reporting compound activity as pIC50 values. The original training data were replicated (i.e., augmented) N times (N ∈ 0, 1, 2, 4, 6, 8, 10), and these replications were perturbed with Gaussian noise (μ = 0, σ = σnoise) on either (i) the pIC50 values, (ii) the compound descriptors, (iii) both the compound descriptors and the pIC50 values, or (iv) none of them. The effect of data augmentation was evaluated across three different algorithms (RF, GBM, and SVM radial) and two descriptor types (Morgan fingerprints and physicochemical-property-based descriptors). The influence of all factor levels was analyzed with a balanced fixed-effect full-factorial experiment. Overall, data augmentation constantly led to increased predictive power on the test set by 10-15%. Injecting noise on (i) compound descriptors or on (ii) both compound descriptors and pIC50 values led to the highest drop of RMSEtest values (from 0.67-0.72 to 0.60-0.63 pIC50 units). The maximum increase in predictive power provided by data augmentation is reached when the training data is replicated one time. Therefore, extending the original training data with one perturbed repetition thereof represents a reasonable trade-off between the increased performance of the models and the computational cost of data augmentation, namely increase of (i) model complexity due to the need for optimizing σnoise and (ii) the number of training examples.

Mesh：

Substances：
Proteins

Year: 2015 PMID： 26619900 DOI： 10.1021/acs.jcim.5b00570

Source DB: PubMed Journal: J Chem Inf Model ISSN： 1549-9596 Impact factor: 4.956

Keyword Cloud
Cited

10 in total

Review 1. Big-Data Science in Porous Materials: Materials Genomics and Machine Learning.

Authors: Kevin Maik Jablonka; Daniele Ongari; Seyed Mohamad Moosavi; Berend Smit
Journal: Chem Rev Date: 2020-06-10 Impact factor: 60.622

2. Identifying diverse metal oxide nanomaterials with lethal effects on embryonic zebrafish using machine learning.

Authors: Richard Liam Marchese Robinson; Haralambos Sarimveis; Philip Doganis; Xiaodong Jia; Marianna Kotzabasaki; Christiana Gousiadou; Stacey Lynn Harper; Terry Wilkins
Journal: Beilstein J Nanotechnol Date: 2021-11-29 Impact factor: 3.649

3. Quantitative Structure-activity Relationship (QSAR) Models for Docking Score Correction.

Authors: Yoshifumi Fukunishi; Satoshi Yamasaki; Isao Yasumatsu; Koh Takeuchi; Takashi Kurosawa; Haruki Nakamura
Journal: Mol Inform Date: 2016-04-29 Impact factor: 3.353

4. QSAR-derived affinity fingerprints (part 2): modeling performance for potency prediction.

Authors: Isidro Cortés-Ciriano; Ctibor Škuta; Andreas Bender; Daniel Svozil
Journal: J Cheminform Date: 2020-06-05 Impact factor: 5.514

5. Inductive transfer learning for molecular activity prediction: Next-Gen QSAR Models with MolPMoFiT.

Authors: Xinhao Li; Denis Fourches
Journal: J Cheminform Date: 2020-04-22 Impact factor: 5.514

6. Augmentation in Healthcare: Augmented Biosignal Using Deep Learning and Tensor Representation.

Authors: Marwa Ibrahim; Mohammad Wedyan; Ryan Alturki; Muazzam A Khan; Adel Al-Jumaily
Journal: J Healthc Eng Date: 2021-01-27 Impact factor: 2.682

7. GPCR_LigandClassify.py; a rigorous machine learning classifier for GPCR targeting compounds.

Authors: Marawan Ahmed; Horia Jalily Hasani; Subha Kalyaanamoorthy; Khaled Barakat
Journal: Sci Rep Date: 2021-05-04 Impact factor: 4.379

8. Surrogate- and invariance-boosted contrastive learning for data-scarce applications in science.

Authors: Charlotte Loh; Thomas Christensen; Rumen Dangovski; Samuel Kim; Marin Soljačić
Journal: Nat Commun Date: 2022-07-21 Impact factor: 17.694

9. Comparison of Quantitative and Qualitative (Q)SAR Models Created for the Prediction of K_i and IC₅₀ Values of Antitarget Inhibitors.

Authors: Alexey A Lagunin; Maria A Romanova; Anton D Zadorozhny; Natalia S Kurilenko; Boris V Shilov; Pavel V Pogodin; Sergey M Ivanov; Dmitry A Filimonov; Vladimir V Poroikov
Journal: Front Pharmacol Date: 2018-10-10 Impact factor: 5.810

10. Network-based piecewise linear regression for QSAR modelling.

Authors: Jonathan Cardoso-Silva; Lazaros G Papageorgiou; Sophia Tsoka
Journal: J Comput Aided Mol Des Date: 2019-10-18 Impact factor: 3.686

10 in total