Literature DB >> 23464929

Oversampling to overcome overfitting: exploring the relationship between data set composition, molecular descriptors, and predictive modeling methods.

Chia-Yun Chang1, Ming-Tsung Hsu, Emilio Xavier Esposito, Yufeng J Tseng.   

Abstract

The traditional biological assay is very time-consuming, and thus the ability to quickly screen large numbers of compounds against a specific biological target is appealing. To speed up the biological evaluation of compounds, high-throughput screening is widely used in the fields of biomedical, biological information, and drug discovery. The research presented in this study focuses on the use of support vector machines, a machine learning method, various classes of molecular descriptors, and different sampling techniques to overcome overfitting to classify compounds for cytotoxicity with respect to the Jurkat cell line. The cell cytotoxicity data set is imbalanced (a few active compounds and very many inactive compounds), and the ability of the predictive modeling methods is adversely affected in these situations. Commonly imbalanced data sets are overfit with respect to the dominant classified end point; in this study the models routinely overfit toward inactive (noncytotoxic) compounds when the imbalance was substantial. Support vector machine (SVM) models were used to probe the proficiency of different classes of molecular descriptors and oversampling ratios. The SVM models were constructed from 4D-FPs, MOE (1D, 2D, and 21/2D), noNP+MOE, and CATS2D trial descriptors pools and compared to the predictive abilities of CATS2D-based random forest models. Compared to previous results in the literature, the SVM models built from oversampled data sets exhibited better predictive abilities for the training and external test sets.

Entities:  

Mesh:

Substances:

Year:  2013        PMID: 23464929     DOI: 10.1021/ci4000536

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  12 in total

1.  Discovery of Influenza A virus neuraminidase inhibitors using support vector machine and Naïve Bayesian models.

Authors:  Wenwen Lian; Jiansong Fang; Chao Li; Xiaocong Pang; Ai-Lin Liu; Guan-Hua Du
Journal:  Mol Divers       Date:  2015-12-21       Impact factor: 2.943

2.  An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data.

Authors:  Ming Hao; Yanli Wang; Stephen H Bryant
Journal:  Anal Chim Acta       Date:  2013-11-06       Impact factor: 6.558

3.  Naïve Bayesian Models for Vero Cell Cytotoxicity.

Authors:  Alexander L Perryman; Jimmy S Patel; Riccardo Russo; Eric Singleton; Nancy Connell; Sean Ekins; Joel S Freundlich
Journal:  Pharm Res       Date:  2018-06-29       Impact factor: 4.200

4.  Modelling compound cytotoxicity using conformal prediction and PubChem HTS data.

Authors:  Fredrik Svensson; Ulf Norinder; Andreas Bender
Journal:  Toxicol Res (Camb)       Date:  2016-10-31       Impact factor: 3.524

Review 5.  Getting the most out of PubChem for virtual screening.

Authors:  Sunghwan Kim
Journal:  Expert Opin Drug Discov       Date:  2016-08-05       Impact factor: 6.098

Review 6.  Modern approaches to accelerate discovery of new antischistosomal drugs.

Authors:  Bruno Junior Neves; Eugene Muratov; Renato Beilner Machado; Carolina Horta Andrade; Pedro Vitor Lemos Cravo
Journal:  Expert Opin Drug Discov       Date:  2016-05-03       Impact factor: 6.098

7.  DeepSnap-Deep Learning Approach Predicts Progesterone Receptor Antagonist Activity With High Performance.

Authors:  Yasunari Matsuzaka; Yoshihiro Uesawa
Journal:  Front Bioeng Biotechnol       Date:  2020-01-22

8.  GPCR_LigandClassify.py; a rigorous machine learning classifier for GPCR targeting compounds.

Authors:  Marawan Ahmed; Horia Jalily Hasani; Subha Kalyaanamoorthy; Khaled Barakat
Journal:  Sci Rep       Date:  2021-05-04       Impact factor: 4.379

9.  Predicting reference soil groups using legacy data: A data pruning and Random Forest approach for tropical environment (Dano catchment, Burkina Faso).

Authors:  Kpade O L Hounkpatin; Karsten Schmidt; Felix Stumpf; Gerald Forkuor; Thorsten Behrens; Thomas Scholten; Wulf Amelung; Gerhard Welp
Journal:  Sci Rep       Date:  2018-07-02       Impact factor: 4.379

10.  Testing Novel Portland Cement Formulations with Carbon Nanotubes and Intrinsic Properties Revelation: Nanoindentation Analysis with Machine Learning on Microstructure Identification.

Authors:  Georgios Konstantopoulos; Elias P Koumoulos; Costas A Charitidis
Journal:  Nanomaterials (Basel)       Date:  2020-03-30       Impact factor: 5.076

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.