Warning: Undefined array key "mm" in /www/wwwroot/www.ai-bt.com/si.php on line 10 Deprecated: trim(): Passing null to parameter #1 ($string) of type string is deprecated in /www/wwwroot/www.ai-bt.com/si.php on line 10 Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification.

Literature DB >> 33669834

Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification.

Anita Rácz¹, Dávid Bajusz², Károly Héberger¹.

Abstract

Applied datasets can vary from a few hundred to thousands of samples in typical quantitative structure-activity/property (QSAR/QSPR) relationships and classification. However, the size of the datasets and the train/test split ratios can greatly affect the outcome of the models, and thus the classification performance itself. We compared several combinations of dataset sizes and split ratios with five different machine learning algorithms to find the differences or similarities and to select the best parameter settings in nonbinary (multiclass) classification. It is also known that the models are ranked differently according to the performance merit(s) used. Here, 25 performance parameters were calculated for each model, then factorial ANOVA was applied to compare the results. The results clearly show the differences not just between the applied machine learning algorithms but also between the dataset sizes and to a lesser extent the train/test split ratios. The XGBoost algorithm could outperform the others, even in multiclass modeling. The performance parameters reacted differently to the change of the sample set size; some of them were much more sensitive to this factor than the others. Moreover, significant differences could be detected between train/test split ratios as well, exerting a great effect on the test validation of our models.

Entities: Chemical Disease Gene Species

Keywords: XGBoost; imbalanced; machine learning; multiclass classification; training/test split ratio; validation

Mesh：

Year: 2021 PMID： 33669834 PMCID： PMC7922354 DOI： 10.3390/molecules26041111

Source DB: PubMed Journal: Molecules ISSN： 1420-3049 Impact factor: 4.411

17 in total

Review 1. Class-imbalanced classifiers for high-dimensional data.

Authors: Wei-Jiun Lin; James J Chen
Journal: Brief Bioinform Date: 2012-03-09 Impact factor: 11.622

2. Extreme learning machine for regression and multiclass classification.

Authors: Guang-Bin Huang; Hongming Zhou; Xiaojian Ding; Rui Zhang
Journal: IEEE Trans Syst Man Cybern B Cybern Date: 2011-10-06

3. Classification ensembles for unbalanced class sizes in predictive toxicology.

Authors: J J Chen; C A Tsai; J F Young; R L Kodell
Journal: SAR QSAR Environ Res Date: 2005-12 Impact factor: 3.000

4. Support vector machines for classification and regression.

Authors: Richard G Brereton; Gavin R Lloyd
Journal: Analyst Date: 2009-12-23 Impact factor: 4.616

5. Points of significance: Bayes' theorem.

Authors: Jorge López Puga; Martin Krzywinski; Naomi Altman
Journal: Nat Methods Date: 2015-04 Impact factor: 28.547

6. A systematic study of the class imbalance problem in convolutional neural networks.

Authors: Mateusz Buda; Atsuto Maki; Maciej A Mazurowski
Journal: Neural Netw Date: 2018-07-29

7. A novel approach to generate robust classification models to predict developmental toxicity from imbalanced datasets.

Authors: S B Gunturi; N Ramamurthi
Journal: SAR QSAR Environ Res Date: 2014-08-07 Impact factor: 3.000

8. Binary and multi-class classification for androgen receptor agonists, antagonists and binders.

Authors: Geven Piir; Sulev Sild; Uko Maran
Journal: Chemosphere Date: 2020-09-11 Impact factor: 7.086

9. Predicting Fraction Unbound in Human Plasma from Chemical Structure: Improved Accuracy in the Low Value Ranges.

Authors: Reiko Watanabe; Tsuyoshi Esaki; Hitoshi Kawashima; Yayoi Natsume-Kitatani; Chioko Nagao; Rikiya Ohashi; Kenji Mizuguchi
Journal: Mol Pharm Date: 2018-09-27 Impact factor: 4.939

10. Multi-Level Comparison of Machine Learning Classifiers and Their Performance Metrics.

Authors: Anita Rácz; Dávid Bajusz; Károly Héberger
Journal: Molecules Date: 2019-08-01 Impact factor: 4.411

7 in total

1. Machining feature recognition based on deep neural networks to support tight integration with 3D CAD systems.

Authors: Changmo Yeo; Byung Chul Kim; Sanguk Cheon; Jinwon Lee; Duhwan Mun
Journal: Sci Rep Date: 2021-11-12 Impact factor: 4.379

2. Predicting Divorce Prospect Using Ensemble Learning: Support Vector Machine, Linear Model, and Neural Network.

Authors: Mian Muhammad Sadiq Fareed; Ali Raza; Na Zhao; Aqil Tariq; Faizan Younas; Gulnaz Ahmed; Saleem Ullah; Syeda Fizzah Jillani; Irfan Abbas; Muhammad Aslam
Journal: Comput Intell Neurosci Date: 2022-07-11

Review 3. Computer-Aided (In Silico) Modeling of Cytochrome P450-Mediated Food-Drug Interactions (FDI).

Authors: Yelena Guttman; Zohar Kerem
Journal: Int J Mol Sci Date: 2022-07-31 Impact factor: 6.208

4. A Conditional GAN for Generating Time Series Data for Stress Detection in Wearable Physiological Sensor Data.

Authors: Maximilian Ehrhart; Bernd Resch; Clemens Havas; David Niederseer
Journal: Sensors (Basel) Date: 2022-08-10 Impact factor: 3.847

5. COVID-19 cough classification using machine learning and global smartphone recordings.

Authors: Madhurananda Pahar; Marisa Klopper; Robin Warren; Thomas Niesler
Journal: Comput Biol Med Date: 2021-06-17 Impact factor: 4.589

6. Predictive Capability of QSAR Models Based on the CompTox Zebrafish Embryo Assays: An Imbalanced Classification Problem.

Authors: Mario Lovrić; Olga Malev; Göran Klobučar; Roman Kern; Jay J Liu; Bono Lučić
Journal: Molecules Date: 2021-03-15 Impact factor: 4.411

7. Brain Decoding Using fMRI Images for Multiple Subjects through Deep Learning.

Authors: Muhammad Bilal Qureshi; Laraib Azad; Muhammad Shuaib Qureshi; Sheraz Aslam; Ayman Aljarbouh; Muhammad Fayaz
Journal: Comput Math Methods Med Date: 2022-03-01 Impact factor: 2.238

7 in total