Literature DB >> 24331047

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data.

Ming Hao1, Yanli Wang2, Stephen H Bryant3.   

Abstract

It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost+SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF+SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem. Published by Elsevier B.V.

Entities:  

Keywords:  High-throughput screening; Imbalanced classification; Over-sampling; PubChem; Under-sampling

Mesh:

Year:  2013        PMID: 24331047      PMCID: PMC3884825          DOI: 10.1016/j.aca.2013.10.050

Source DB:  PubMed          Journal:  Anal Chim Acta        ISSN: 0003-2670            Impact factor:   6.558


  34 in total

1.  Beware of q2!

Authors:  Alexander Golbraikh; Alexander Tropsha
Journal:  J Mol Graph Model       Date:  2002-01       Impact factor: 2.518

2.  Random forest: a classification and regression tool for compound classification and QSAR modeling.

Authors:  Vladimir Svetnik; Andy Liaw; Christopher Tong; J Christopher Culberson; Robert P Sheridan; Bradley P Feuston
Journal:  J Chem Inf Comput Sci       Date:  2003 Nov-Dec

3.  Boosting for tumor classification with gene expression data.

Authors:  Marcel Dettling; Peter Bühlmann
Journal:  Bioinformatics       Date:  2003-06-12       Impact factor: 6.937

4.  Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

Authors:  B W Matthews
Journal:  Biochim Biophys Acta       Date:  1975-10-20

5.  Toward an optimal procedure for PC-ANN model building: prediction of the carcinogenic activity of a large set of drugs.

Authors:  Bahram Hemmateenejad; Mohammad A Safarpour; Ramin Miri; Nasim Nesari
Journal:  J Chem Inf Model       Date:  2005 Jan-Feb       Impact factor: 4.956

Review 6.  Managing molecular diversity.

Authors:  Juan J Perez
Journal:  Chem Soc Rev       Date:  2005-01-10       Impact factor: 54.564

7.  Identifying SNPs predictive of phenotype using random forests.

Authors:  Alexandre Bureau; Josée Dupuis; Kathleen Falls; Kathryn L Lunetta; Brooke Hayward; Tim P Keith; Paul Van Eerdewegh
Journal:  Genet Epidemiol       Date:  2005-02       Impact factor: 2.135

8.  A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data.

Authors:  Kenzie D Macisaac; D Benjamin Gordon; Lena Nekludova; Duncan T Odom; Joerg Schreiber; David K Gifford; Richard A Young; Ernest Fraenkel
Journal:  Bioinformatics       Date:  2005-12-06       Impact factor: 6.937

9.  QSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically-based numerical descriptors.

Authors:  G W Kauffman; P C Jurs
Journal:  J Chem Inf Comput Sci       Date:  2001 Nov-Dec

10.  Gene selection and classification of microarray data using random forest.

Authors:  Ramón Díaz-Uriarte; Sara Alvarez de Andrés
Journal:  BMC Bioinformatics       Date:  2006-01-06       Impact factor: 3.169

View more
  16 in total

1.  Decision tree-based classifiers for lung cancer diagnosis and subtyping using TCGA miRNA expression data.

Authors:  Masih Sherafatian; Fateme Arjmand
Journal:  Oncol Lett       Date:  2019-06-10       Impact factor: 2.967

2.  Distance Metric Based Oversampling Method for Bioinformatics and Performance Evaluation.

Authors:  Meng-Fong Tsai; Shyr-Shen Yu
Journal:  J Med Syst       Date:  2016-05-16       Impact factor: 4.460

3.  Improved prediction of drug-target interactions using regularized least squares integrating with kernel fusion technique.

Authors:  Ming Hao; Yanli Wang; Stephen H Bryant
Journal:  Anal Chim Acta       Date:  2016-01-14       Impact factor: 6.558

4.  Cardiovascular disease detection using machine learning and carotid/femoral arterial imaging frameworks in rheumatoid arthritis patients.

Authors:  George Konstantonis; Krishna V Singh; Petros P Sfikakis; Ankush D Jamthikar; George D Kitas; Suneet K Gupta; Luca Saba; Kleio Verrou; Narendra N Khanna; Zoltan Ruzsa; Aditya M Sharma; John R Laird; Amer M Johri; Manudeep Kalra; Athanasios Protogerou; Jasjit S Suri
Journal:  Rheumatol Int       Date:  2022-01-11       Impact factor: 2.631

5.  Comparison of Four Machine Learning Techniques for Prediction of Intensive Care Unit Length of Stay in Heart Transplantation Patients.

Authors:  Kan Wang; Li Zhao Yan; Wang Zi Li; Chen Jiang; Ni Ni Wang; Qiang Zheng; Nian Guo Dong; Jia Wei Shi
Journal:  Front Cardiovasc Med       Date:  2022-06-21

Review 6.  Getting the most out of PubChem for virtual screening.

Authors:  Sunghwan Kim
Journal:  Expert Opin Drug Discov       Date:  2016-08-05       Impact factor: 6.098

Review 7.  PubChem applications in drug discovery: a bibliometric analysis.

Authors:  Tiejun Cheng; Yongmei Pan; Ming Hao; Yanli Wang; Stephen H Bryant
Journal:  Drug Discov Today       Date:  2014-08-27       Impact factor: 7.851

8.  Mining Chemical Activity Status from High-Throughput Screening Assays.

Authors:  Othman Soufan; Wail Ba-alawi; Moataz Afeef; Magbubah Essack; Valentin Rodionov; Panos Kalnis; Vladimir B Bajic
Journal:  PLoS One       Date:  2015-12-14       Impact factor: 3.240

9.  Handling class imbalance problem in miRNA dataset associated with cancer.

Authors:  Ram Kothandan
Journal:  Bioinformation       Date:  2015-01-30

10.  Cheminformatics analysis of the AR agonist and antagonist datasets in PubChem.

Authors:  Ming Hao; Stephen H Bryant; Yanli Wang
Journal:  J Cheminform       Date:  2016-07-08       Impact factor: 5.514

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.