Literature DB >> 34100609

GHOST: Adjusting the Decision Threshold to Handle Imbalanced Data in Machine Learning.

Carmen Esposito1, Gregory A Landrum1,2, Nadine Schneider3, Nikolaus Stiefl3, Sereina Riniker1.   

Abstract

Machine learning classifiers trained on class imbalanced data are prone to overpredict the majority class. This leads to a larger misclassification rate for the minority class, which in many real-world applications is the class of interest. For binary data, the classification threshold is set by default to 0.5 which, however, is often not ideal for imbalanced data. Adjusting the decision threshold is a good strategy to deal with the class imbalance problem. In this work, we present two different automated procedures for the selection of the optimal decision threshold for imbalanced classification. A major advantage of our procedures is that they do not require retraining of the machine learning models or resampling of the training data. The first approach is specific for random forest (RF), while the second approach, named GHOST, can be potentially applied to any machine learning classifier. We tested these procedures on 138 public drug discovery data sets containing structure-activity data for a variety of pharmaceutical targets. We show that both thresholding methods improve significantly the performance of RF. We tested the use of GHOST with four different classifiers in combination with two molecular descriptors, and we found that most classifiers benefit from threshold optimization. GHOST also outperformed other strategies, including random undersampling and conformal prediction. Finally, we show that our thresholding procedures can be effectively applied to real-world drug discovery projects, where the imbalance and characteristics of the data vary greatly between the training and test sets.

Year:  2021        PMID: 34100609     DOI: 10.1021/acs.jcim.1c00160

Source DB:  PubMed          Journal:  J Chem Inf Model        ISSN: 1549-9596            Impact factor:   4.956


  7 in total

1.  Comparing classification models-a practical tutorial.

Authors:  W Patrick Walters
Journal:  J Comput Aided Mol Des       Date:  2021-09-22       Impact factor: 4.179

2.  Protposer: The web server that readily proposes protein stabilizing mutations with high PPV.

Authors:  Helena García-Cebollada; Alfonso López; Javier Sancho
Journal:  Comput Struct Biotechnol J       Date:  2022-05-10       Impact factor: 6.155

3.  Analysis of the benefits of imputation models over traditional QSAR models for toxicity prediction.

Authors:  Moritz Walter; Luke N Allen; Antonio de la Vega de León; Samuel J Webb; Valerie J Gillet
Journal:  J Cheminform       Date:  2022-06-07       Impact factor: 8.489

4.  Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks.

Authors:  Sivaramakrishnan Rajaraman; Prasanth Ganesan; Sameer Antani
Journal:  PLoS One       Date:  2022-01-27       Impact factor: 3.240

5.  Active Learning Configuration Interaction for Excited-State Calculations of Polycyclic Aromatic Hydrocarbons.

Authors:  WooSeok Jeong; Carlo Alberto Gaggioli; Laura Gagliardi
Journal:  J Chem Theory Comput       Date:  2021-11-17       Impact factor: 6.006

6.  Classification of facial paralysis based on machine learning techniques.

Authors:  Amira Gaber; Mona F Taher; Manal Abdel Wahed; Nevin Mohieldin Shalaby; Sarah Gaber
Journal:  Biomed Eng Online       Date:  2022-09-07       Impact factor: 3.903

7.  Augmented Hill-Climb increases reinforcement learning efficiency for language-based de novo molecule generation.

Authors:  Morgan Thomas; Noel M O'Boyle; Andreas Bender; Chris de Graaf
Journal:  J Cheminform       Date:  2022-10-03       Impact factor: 8.489

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.