Literature DB >> 33669834

Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification.

Anita Rácz1, Dávid Bajusz2, Károly Héberger1.   

Abstract

Applied datasets can vary from a few hundred to thousands of samples in typical quantitative structure-activity/property (QSAR/QSPR) relationships and classification. However, the size of the datasets and the train/test split ratios can greatly affect the outcome of the models, and thus the classification performance itself. We compared several combinations of dataset sizes and split ratios with five different machine learning algorithms to find the differences or similarities and to select the best parameter settings in nonbinary (multiclass) classification. It is also known that the models are ranked differently according to the performance merit(s) used. Here, 25 performance parameters were calculated for each model, then factorial ANOVA was applied to compare the results. The results clearly show the differences not just between the applied machine learning algorithms but also between the dataset sizes and to a lesser extent the train/test split ratios. The XGBoost algorithm could outperform the others, even in multiclass modeling. The performance parameters reacted differently to the change of the sample set size; some of them were much more sensitive to this factor than the others. Moreover, significant differences could be detected between train/test split ratios as well, exerting a great effect on the test validation of our models.

Entities:  

Keywords:  XGBoost; imbalanced; machine learning; multiclass classification; training/test split ratio; validation

Mesh:

Year:  2021        PMID: 33669834      PMCID: PMC7922354          DOI: 10.3390/molecules26041111

Source DB:  PubMed          Journal:  Molecules        ISSN: 1420-3049            Impact factor:   4.411


  17 in total

Review 1.  Class-imbalanced classifiers for high-dimensional data.

Authors:  Wei-Jiun Lin; James J Chen
Journal:  Brief Bioinform       Date:  2012-03-09       Impact factor: 11.622

2.  Extreme learning machine for regression and multiclass classification.

Authors:  Guang-Bin Huang; Hongming Zhou; Xiaojian Ding; Rui Zhang
Journal:  IEEE Trans Syst Man Cybern B Cybern       Date:  2011-10-06

3.  Classification ensembles for unbalanced class sizes in predictive toxicology.

Authors:  J J Chen; C A Tsai; J F Young; R L Kodell
Journal:  SAR QSAR Environ Res       Date:  2005-12       Impact factor: 3.000

4.  Support vector machines for classification and regression.

Authors:  Richard G Brereton; Gavin R Lloyd
Journal:  Analyst       Date:  2009-12-23       Impact factor: 4.616

5.  Points of significance: Bayes' theorem.

Authors:  Jorge López Puga; Martin Krzywinski; Naomi Altman
Journal:  Nat Methods       Date:  2015-04       Impact factor: 28.547

6.  A systematic study of the class imbalance problem in convolutional neural networks.

Authors:  Mateusz Buda; Atsuto Maki; Maciej A Mazurowski
Journal:  Neural Netw       Date:  2018-07-29

7.  A novel approach to generate robust classification models to predict developmental toxicity from imbalanced datasets.

Authors:  S B Gunturi; N Ramamurthi
Journal:  SAR QSAR Environ Res       Date:  2014-08-07       Impact factor: 3.000

8.  Binary and multi-class classification for androgen receptor agonists, antagonists and binders.

Authors:  Geven Piir; Sulev Sild; Uko Maran
Journal:  Chemosphere       Date:  2020-09-11       Impact factor: 7.086

9.  Predicting Fraction Unbound in Human Plasma from Chemical Structure: Improved Accuracy in the Low Value Ranges.

Authors:  Reiko Watanabe; Tsuyoshi Esaki; Hitoshi Kawashima; Yayoi Natsume-Kitatani; Chioko Nagao; Rikiya Ohashi; Kenji Mizuguchi
Journal:  Mol Pharm       Date:  2018-09-27       Impact factor: 4.939

10.  Multi-Level Comparison of Machine Learning Classifiers and Their Performance Metrics.

Authors:  Anita Rácz; Dávid Bajusz; Károly Héberger
Journal:  Molecules       Date:  2019-08-01       Impact factor: 4.411

View more
  7 in total

1.  Machining feature recognition based on deep neural networks to support tight integration with 3D CAD systems.

Authors:  Changmo Yeo; Byung Chul Kim; Sanguk Cheon; Jinwon Lee; Duhwan Mun
Journal:  Sci Rep       Date:  2021-11-12       Impact factor: 4.379

2.  Predicting Divorce Prospect Using Ensemble Learning: Support Vector Machine, Linear Model, and Neural Network.

Authors:  Mian Muhammad Sadiq Fareed; Ali Raza; Na Zhao; Aqil Tariq; Faizan Younas; Gulnaz Ahmed; Saleem Ullah; Syeda Fizzah Jillani; Irfan Abbas; Muhammad Aslam
Journal:  Comput Intell Neurosci       Date:  2022-07-11

Review 3.  Computer-Aided (In Silico) Modeling of Cytochrome P450-Mediated Food-Drug Interactions (FDI).

Authors:  Yelena Guttman; Zohar Kerem
Journal:  Int J Mol Sci       Date:  2022-07-31       Impact factor: 6.208

4.  A Conditional GAN for Generating Time Series Data for Stress Detection in Wearable Physiological Sensor Data.

Authors:  Maximilian Ehrhart; Bernd Resch; Clemens Havas; David Niederseer
Journal:  Sensors (Basel)       Date:  2022-08-10       Impact factor: 3.847

5.  COVID-19 cough classification using machine learning and global smartphone recordings.

Authors:  Madhurananda Pahar; Marisa Klopper; Robin Warren; Thomas Niesler
Journal:  Comput Biol Med       Date:  2021-06-17       Impact factor: 4.589

6.  Predictive Capability of QSAR Models Based on the CompTox Zebrafish Embryo Assays: An Imbalanced Classification Problem.

Authors:  Mario Lovrić; Olga Malev; Göran Klobučar; Roman Kern; Jay J Liu; Bono Lučić
Journal:  Molecules       Date:  2021-03-15       Impact factor: 4.411

7.  Brain Decoding Using fMRI Images for Multiple Subjects through Deep Learning.

Authors:  Muhammad Bilal Qureshi; Laraib Azad; Muhammad Shuaib Qureshi; Sheraz Aslam; Ayman Aljarbouh; Muhammad Fayaz
Journal:  Comput Math Methods Med       Date:  2022-03-01       Impact factor: 2.238

  7 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.