| Literature DB >> 33958703 |
Claas Flint1,2, Micah Cearns3,4, Udo Dannlowski5, Tim Hahn1, Nils Opel1, Ronny Redlich1, David M A Mehler1, Daniel Emden1, Nils R Winter1, Ramona Leenings1, Simon B Eickhoff6,7, Tilo Kircher8, Axel Krug8, Igor Nenadic8, Volker Arolt1, Scott Clark3, Bernhard T Baune1,4,9, Xiaoyi Jiang2.
Abstract
We currently observe a disconcerting phenomenon in machine learning studies in psychiatry: While we would expect larger samples to yield better results due to the availability of more data, larger machine learning studies consistently show much weaker performance than the numerous small-scale studies. Here, we systematically investigated this effect focusing on one of the most heavily studied questions in the field, namely the classification of patients suffering from Major Depressive Disorder (MDD) and healthy controls based on neuroimaging data. Drawing upon structural MRI data from a balanced sample of N = 1868 MDD patients and healthy controls from our recent international Predictive Analytics Competition (PAC), we first trained and tested a classification model on the full dataset which yielded an accuracy of 61%. Next, we mimicked the process by which researchers would draw samples of various sizes (N = 4 to N = 150) from the population and showed a strong risk of misestimation. Specifically, for small sample sizes (N = 20), we observe accuracies of up to 95%. For medium sample sizes (N = 100) accuracies up to 75% were found. Importantly, further investigation showed that sufficiently large test sets effectively protect against performance misestimation whereas larger datasets per se do not. While these results question the validity of a substantial part of the current literature, we outline the relatively low-cost remedy of larger test sets, which is readily available in most cases.Entities:
Mesh:
Year: 2021 PMID: 33958703 PMCID: PMC8209109 DOI: 10.1038/s41386-021-01020-7
Source DB: PubMed Journal: Neuropsychopharmacology ISSN: 0893-133X Impact factor: 7.853
Fig. 1Workflow to investigate the correlations between sample size and misestimation.
First the effect of misestimation is investigated over the whole classification process ((1) Overall sample size analysis). Following the process of training and testing is evaluated seperatly ((2) Training set sample size analysis, (3) Test set sample size analysis).
Fig. 2Effects of varying overall sample sizes employing LOOCV.
a Probabilities for linear SVMs to yield an accuracy exceeding a certain threshold as a function of sample size employing LOOCV. b Minimum, maximum and mean results for the linear SVMs as a function of sample size employing LOOCV. c Probabilities for dummy classifiers to get an accuracy above a certain chance level related to the size of the used sample. d Minimum, maximum and mean results for the dummy classifiers related to the size of the used sample size for training and testing.
Fig. 3Results as a function of training set sizes with a fixed test set size of N = 300.
a Probabilities for linear SVMs to yield an accuracy exceeding a certain threshold as a function of training sample size. b Minimum, maximum and mean results for the linear SVMs as a function of training sample size. c Probabilities for the dummy classifier to get an accuracy above a certain chance level related to the size of the training set size. d Minimum, maximum and mean results for the dummy classifier related to the size of the used sample size for training.
Fig. 4Results as a function of variable test set sizes with and a fixed classifier.
a Probabilities for linear SVMs to yield an accuracy exceeding a certain threshold as a function of test sample size. b Minimum, maximum and mean results for the linear SVMs as a function of test sample size. c Probabilities for the dummy classifier to get an accuracy above a certain chance level related to the size of the test set size. d Minimum, maximum and mean results for the dummy classifier related to the size of the used sample size for testing.