| Literature DB >> 35350584 |
Andrew Cwiek1,2, Sarah M Rajtmajer3,4, Bradley Wyble1, Vasant Honavar3,5, Emily Grossner1,2, Frank G Hillary1,2.
Abstract
In this critical review, we examine the application of predictive models, for example, classifiers, trained using machine learning (ML) to assist in interpretation of functional neuroimaging data. Our primary goal is to summarize how ML is being applied and critically assess common practices. Our review covers 250 studies published using ML and resting-state functional MRI (fMRI) to infer various dimensions of the human functional connectome. Results for holdout ("lockbox") performance was, on average, ∼13% less accurate than performance measured through cross-validation alone, highlighting the importance of lockbox data, which was included in only 16% of the studies. There was also a concerning lack of transparency across the key steps in training and evaluating predictive models. The summary of this literature underscores the importance of the use of a lockbox and highlights several methodological pitfalls that can be addressed by the imaging community. We argue that, ideally, studies are motivated both by the reproducibility and generalizability of findings as well as the potential clinical significance of the insights. We offer recommendations for principled integration of machine learning into the clinical neurosciences with the goal of advancing imaging biomarkers of brain disorders, understanding causative determinants for health risks, and parsing heterogeneous patient outcomes.Entities:
Keywords: Brain networks; Classifiers; Clinical neuroscience; Machine learning; Predictive modeling
Year: 2022 PMID: 35350584 PMCID: PMC8942606 DOI: 10.1162/netn_a_00212
Source DB: PubMed Journal: Netw Neurosci ISSN: 2472-1751
PRISMA flowchart of literature review. *An initial PubMed search was conducted, following valuable feedback, an updated search was conducted including articles up to the year 2021, and which included terms to broaden the search to include deep learning algorithms. For details, please see section Method: Literature Review. **Initial Review did not delineate removal at particular step; updated review includes a step-by-step workflow. ***220 from updated search + 30 nonduplicates from initial search. Modification of flowchart provided by Page et al. (2021).
Sample sizes for population and subgroups in training and test datasets
|
|
|
| ||
|---|---|---|---|---|
|
|
|
|
| |
|
| 17–1305 | 8–653 | 8–477 | 1–185 |
|
| 126.7 | 50.0 | 96.6 | 38.1 |
|
| 77 | 29 | 39 | 20 |
|
| 80 (32.0%) | 192 (76.8%) | 23 (52.3%) | 35 (79.6%) |
|
| 24 (9.6%) | 136 (54.4%) | 14 (31.8%) | 28 (63.6) |
|
| 3 (1.2%) | 82 (32.8%) | 8 (18.2%) | 22 (50.0%) |
Network data: Characteristics of functional brain imaging network analysis including in prediction modeling
|
|
|
|
| ||
|---|---|---|---|---|---|
|
| <10 to 67,955 | 90 | 483.9 (6,654.5) | 90 | |
|
|
|
|
|
| |
|
| 67.9% | 3.2% | 6.1% | 3.6% | 18.3% |
|
|
|
|
| ||
|
| 73.1% | 19.0% | 7.9% | 3% | |
Note: All studies included defined nodes, but in some cases the exact number of nodes was unclear with respect to ML training (n = 30). Similarly, all studies examined connectivity between brain regions, but for a small number of studies there was no clear edge definition (n = 3).
Classifier types, inputs, and metrics for evaluation during classification
|
|
|
|
|
|
|
|
|
|
|---|---|---|---|---|---|---|---|---|
|
| 171 (68.4%) | 20 (8.0%) | 17 (6.8%) | 22 (8.8%) | 22 (8.8%) | 20 (8.0%) | 46 (18.0%) | 52 (20.8%) |
|
|
|
|
|
|
|
|
|
|
|
| 100% | 13.5% | 10.1% | 5.9% | 2.5% | 1.7% | 0% | 1.6% |
|
|
|
|
|
|
|
|
| |
|
| 87% | 70.4% | 69% | 40% | 12% | 12% | 20% | |
Note: SVM, support vector machine; RF, random forest; KNN, k nearest-neighbor; LOG_R, logistic regression; LDA, linear discriminant analysis. *Total >100%, including studies with more than one classification approach.
Validation measures
|
| |||
|---|---|---|---|
|
|
|
| |
|
| 94.1% | 4.2% | 1.7% |
|
| 20.3% | 79.7% | 0.0% |
|
| 70.8% | 12.5% | 16.7% |
Common techniques for enhancing model interpretation
|
| ||
|---|---|---|
|
|
| |
|
| 47.2% | 52.8% |
|
| 34.0% | 66.0% |
|
| 27.7% | 72.3% |
|
| 20.0% | 80.0% |
Note: >100% due to multiple approaches used in some studies.
A histogram of accuracy scores for n = 250 studies reviewed reveals distinct distributions and median scores (organized in text boxes by color) for classification accuracy based on results using no validation, cross-validation, and external validation (i.e., lockbox).
Illustration of distinct decision points in the typical ML pipeline in the papers included in this review. We identify eight distinct decision points where there are opportunities to report (R) information to maximize transparency. R1a: Justify classifier model choice from previous literature, limitations of data, and clinical goals of study. R1b: Explain how data were split between training and test sets (i.e., lockbox), including sample sizes and any matching of demographics or disease variables. R2: Make clear decisions about how the network was created, including edge definition and brain parcellation. R3: Make explicit the specifics of the model (e.g., parameter settings, kernel functions). Make clear which features (e.g., network metrics, clinical variables) are included in the model. R4: Report cross-validation method selection and implementation; justify use in context of sample size and potential risk of performance overestimation. R5: Explain the conditions necessary to terminate algorithm training, such as target performance or minimal feature count. R6: Make explicit the hyperparameter settings and any manual tuning of parameters between training iterations. R7a: Report training set results, including model performance, feature weights, and feature counts across training iterations. R7b: Explicitly state that preprocessing is unchanged from the final algorithm derived from training and that during training there was no access to the lockbox; provide the final averaged cross-validation performance and feature importance for the test set. R8: Provide clear interpretation and explainability for the model by highlighting any key findings in context of potential clinical utility (i.e., relevant regions of interest’s connectivity patterns).