Literature DB >> 23293925

Coping with unbalanced class data sets in oral absorption models.

Danielle Newby1, Alex A Freitas, Taravat Ghafourian.   

Abstract

Class imbalance occurs frequently in drug discovery data sets. In oral absorption data sets, in the literature, there are considerably more highly absorbed compounds compared to poorly absorbed compounds. This produces models that are biased toward highly absorbed compounds which lack generalization to industry settings where more early stage drug candidates are poorly absorbed. This paper presents two strategies to cope with unbalanced class data sets: undersampling the majority high absorption class and misclassification costs using classification decision trees. The published data set by Hou et al. [J. Chem. Inf. Model.2007, 47, 208-218], which contained percentage human intestinal absorption of 645 drug and drug-like compounds, was used for the development and validation of classification trees using classification and regression tree (C&RT) analysis. The results indicate that undersampling the majority class, highly absorbed compounds, leads to a balanced distribution (50:50) training set which can achieve better accuracies for poorly absorbed compounds, whereas the biased training set achieved higher accuracies for highly absorbed compounds. The use of misclassification costs resulted in improved class predictions, when applied to reduce false positives or false negatives. Moreover, it was shown that the classical overall accuracy measure used in many publications is particularly misleading in the case of unbalanced data sets and more appropriate measures presented here may be used for a more realistic assessment of the classification models' performance. Thus, these strategies offer improvements to cope with unbalanced class data sets to obtain classification models applicable in industry.

Entities:  

Mesh:

Year:  2013        PMID: 23293925     DOI: 10.1021/ci300348u

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  8 in total

1.  Exploring different strategies for imbalanced ADME data problem: case study on Caco-2 permeability modeling.

Authors:  Hai Pham-The; Gerardo Casañola-Martin; Teresa Garrigues; Marival Bermejo; Isabel González-Álvarez; Nam Nguyen-Hai; Miguel Ángel Cabrera-Pérez; Huong Le-Thi-Thu
Journal:  Mol Divers       Date:  2015-12-07       Impact factor: 2.943

2.  Estimation of biliary excretion of foreign compounds using properties of molecular structure.

Authors:  Mohsen Sharifi; Taravat Ghafourian
Journal:  AAPS J       Date:  2013-11-08       Impact factor: 4.009

3.  Undersampling: case studies of flaviviral inhibitory activities.

Authors:  Stephen J Barigye; José Manuel García de la Vega; Juan A Castillo-Garit
Journal:  J Comput Aided Mol Des       Date:  2019-11-26       Impact factor: 3.686

4.  Machine learning for predicting lifespan-extending chemical compounds.

Authors:  Diogo G Barardo; Danielle Newby; Daniel Thornton; Taravat Ghafourian; João Pedro de Magalhães; Alex A Freitas
Journal:  Aging (Albany NY)       Date:  2017-07-18       Impact factor: 5.682

5.  Towards a generalized toxicity prediction model for oxide nanomaterials using integrated data from different sources.

Authors:  Jang-Sik Choi; My Kieu Ha; Tung Xuan Trinh; Tae Hyun Yoon; Hyung-Gi Byun
Journal:  Sci Rep       Date:  2018-04-17       Impact factor: 4.379

6.  An Introduction to Machine Learning.

Authors:  Solveig Badillo; Balazs Banfai; Fabian Birzele; Iakov I Davydov; Lucy Hutchinson; Tony Kam-Thong; Juliane Siebourg-Polster; Bernhard Steiert; Jitao David Zhang
Journal:  Clin Pharmacol Ther       Date:  2020-03-03       Impact factor: 6.875

7.  A novel adaptive ensemble classification framework for ADME prediction.

Authors:  Ming Yang; Jialei Chen; Liwen Xu; Xiufeng Shi; Xin Zhou; Zhijun Xi; Rui An; Xinhong Wang
Journal:  RSC Adv       Date:  2018-03-26       Impact factor: 4.036

8.  QSAR modeling of imbalanced high-throughput screening data in PubChem.

Authors:  Alexey V Zakharov; Megan L Peach; Markus Sitzmann; Marc C Nicklaus
Journal:  J Chem Inf Model       Date:  2014-02-28       Impact factor: 4.956

  8 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.