Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data.

Literature DB >> 24331047

An efficient algorithm coupled with synthetic minority over-sampling technique to classify imbalanced PubChem BioAssay data.

Ming Hao¹, Yanli Wang², Stephen H Bryant³.

Abstract

It is common that imbalanced datasets are often generated from high-throughput screening (HTS). For a given dataset without taking into account the imbalanced nature, most classification methods tend to produce high predictive accuracy for the majority class, but significantly poor performance for the minority class. In this work, an efficient algorithm, GLMBoost, coupled with Synthetic Minority Over-sampling TEchnique (SMOTE) is developed and utilized to overcome the problem for several imbalanced datasets from PubChem BioAssay. By applying the proposed combinatorial method, those data of rare samples (active compounds), for which usually poor results are generated, can be detected apparently with high balanced accuracy (Gmean). As a comparison with GLMBoost, Random Forest (RF) combined with SMOTE is also adopted to classify the same datasets. Our results show that the former (GLMBoost+SMOTE) not only exhibits higher performance as measured by the percentage of correct classification for the rare samples (Sensitivity) and Gmean, but also demonstrates greater computational efficiency than the latter (RF+SMOTE). Therefore, we hope that the proposed combinatorial algorithm based on GLMBoost and SMOTE could be extensively used to tackle the imbalanced classification problem. Published by Elsevier B.V.

Entities: Chemical Disease Gene Species

Keywords: High-throughput screening; Imbalanced classification; Over-sampling; PubChem; Under-sampling

Mesh：

Year: 2013 PMID： 24331047 PMCID： PMC3884825 DOI： 10.1016/j.aca.2013.10.050

Source DB: PubMed Journal: Anal Chim Acta ISSN： 0003-2670 Impact factor: 6.558

34 in total

1. Beware of q2!

Authors: Alexander Golbraikh; Alexander Tropsha
Journal: J Mol Graph Model Date: 2002-01 Impact factor: 2.518

2. Random forest: a classification and regression tool for compound classification and QSAR modeling.

Authors: Vladimir Svetnik; Andy Liaw; Christopher Tong; J Christopher Culberson; Robert P Sheridan; Bradley P Feuston
Journal: J Chem Inf Comput Sci Date: 2003 Nov-Dec

3. Boosting for tumor classification with gene expression data.

Authors: Marcel Dettling; Peter Bühlmann
Journal: Bioinformatics Date: 2003-06-12 Impact factor: 6.937

4. Comparison of the predicted and observed secondary structure of T4 phage lysozyme.

Authors: B W Matthews
Journal: Biochim Biophys Acta Date: 1975-10-20

5. Toward an optimal procedure for PC-ANN model building: prediction of the carcinogenic activity of a large set of drugs.

Authors: Bahram Hemmateenejad; Mohammad A Safarpour; Ramin Miri; Nasim Nesari
Journal: J Chem Inf Model Date: 2005 Jan-Feb Impact factor: 4.956

Review 6. Managing molecular diversity.

Authors: Juan J Perez
Journal: Chem Soc Rev Date: 2005-01-10 Impact factor: 54.564

7. Identifying SNPs predictive of phenotype using random forests.

Authors: Alexandre Bureau; Josée Dupuis; Kathleen Falls; Kathryn L Lunetta; Brooke Hayward; Tim P Keith; Paul Van Eerdewegh
Journal: Genet Epidemiol Date: 2005-02 Impact factor: 2.135

8. A hypothesis-based approach for identifying the binding specificity of regulatory proteins from chromatin immunoprecipitation data.

Authors: Kenzie D Macisaac; D Benjamin Gordon; Lena Nekludova; Duncan T Odom; Joerg Schreiber; David K Gifford; Richard A Young; Ernest Fraenkel
Journal: Bioinformatics Date: 2005-12-06 Impact factor: 6.937

9. QSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically-based numerical descriptors.

Authors: G W Kauffman; P C Jurs
Journal: J Chem Inf Comput Sci Date: 2001 Nov-Dec

10. Gene selection and classification of microarray data using random forest.

Authors: Ramón Díaz-Uriarte; Sara Alvarez de Andrés
Journal: BMC Bioinformatics Date: 2006-01-06 Impact factor: 3.169

16 in total

1. Decision tree-based classifiers for lung cancer diagnosis and subtyping using TCGA miRNA expression data.

Authors: Masih Sherafatian; Fateme Arjmand
Journal: Oncol Lett Date: 2019-06-10 Impact factor: 2.967

2. Distance Metric Based Oversampling Method for Bioinformatics and Performance Evaluation.

Authors: Meng-Fong Tsai; Shyr-Shen Yu
Journal: J Med Syst Date: 2016-05-16 Impact factor: 4.460

3. Improved prediction of drug-target interactions using regularized least squares integrating with kernel fusion technique.

Authors: Ming Hao; Yanli Wang; Stephen H Bryant
Journal: Anal Chim Acta Date: 2016-01-14 Impact factor: 6.558

4. Cardiovascular disease detection using machine learning and carotid/femoral arterial imaging frameworks in rheumatoid arthritis patients.

Authors: George Konstantonis; Krishna V Singh; Petros P Sfikakis; Ankush D Jamthikar; George D Kitas; Suneet K Gupta; Luca Saba; Kleio Verrou; Narendra N Khanna; Zoltan Ruzsa; Aditya M Sharma; John R Laird; Amer M Johri; Manudeep Kalra; Athanasios Protogerou; Jasjit S Suri
Journal: Rheumatol Int Date: 2022-01-11 Impact factor: 2.631

5. Comparison of Four Machine Learning Techniques for Prediction of Intensive Care Unit Length of Stay in Heart Transplantation Patients.

Authors: Kan Wang; Li Zhao Yan; Wang Zi Li; Chen Jiang; Ni Ni Wang; Qiang Zheng; Nian Guo Dong; Jia Wei Shi
Journal: Front Cardiovasc Med Date: 2022-06-21