| Literature DB >> 30831310 |
Dustin Scheinost1, Stephanie Noble2, Corey Horien2, Abigail S Greene2, Evelyn Mr Lake3, Mehraveh Salehi4, Siyuan Gao5, Xilin Shen3, David O'Connor5, Daniel S Barron6, Sarah W Yip7, Monica D Rosenberg8, R Todd Constable9.
Abstract
Establishing brain-behavior associations that map brain organization to phenotypic measures and generalize to novel individuals remains a challenge in neuroimaging. Predictive modeling approaches that define and validate models with independent datasets offer a solution to this problem. While these methods can detect novel and generalizable brain-behavior associations, they can be daunting, which has limited their use by the wider connectivity community. Here, we offer practical advice and examples based on functional magnetic resonance imaging (fMRI) functional connectivity data for implementing these approaches. We hope these ten rules will increase the use of predictive models with neuroimaging data.Entities:
Keywords: Classification; Connectome; Cross-validation; Machine learning; Neural networks
Mesh:
Year: 2019 PMID: 30831310 PMCID: PMC6521850 DOI: 10.1016/j.neuroimage.2019.02.057
Source DB: PubMed Journal: Neuroimage ISSN: 1053-8119 Impact factor: 6.556
Definitions of common terms.
| Term | Definition |
|---|---|
| confound | Variable that affects the study variables and systemically differs across individuals |
| cross-validation | Methods of internal model validation in which a single dataset is divided into testing and training data several times |
| explanatory modeling | The generation and use of statistical models to test hypotheses about associations between observed data. |
| exploratory research | Research conducted at an early stage of inquiry to generate new hypotheses and establish a framework for future analyses |
| external validation | Testing predictive model performance on an independently collected dataset |
| false negative | Cases incorrectly classified by the model (e.g., patients incorrectly identified as non-patients) |
| false positive | Non-cases incorrectly classified by the model (e.g., non-patients incorrectly identified as patients) |
| generalizability | How well results from a sample population reflect the population at large |
| hyperparameters | Free parameters for an algorithm that need to be determined |
| interpretability | The ability of a researcher to understand a model and use it to better understand brain-behavior associations |
| model | An equation that maps a set of independent variables ( |
| multiple comparisons | Evaluating many hypotheses via statistical inferences simultaneously which may lead to observing a significant result simply by chance |
| nested cross-validation | A validation approach where, in each fold of a cross-validation, a second cross-validation is used to estimate a free parameter. |
| nuisance variable | Variables that are associated with study variables causing increased data variability, but have no pertinent neurobiological meaning to the study question. |
| overfitting | The tendency for statistical models to mistakenly fit sample-specific noise as if it were signal, leading to inflated effect size estimates |
| p-hacking | Selectively choosing data or analysis pipelines to obtain significant results |
| phenotype | An observable characteristics of an individual (e.g. behavior) |
| predictive modeling | The generation and use of statistical models to estimate new or future information. The term is synonymous with machine learning. |
| sensitivity | The proportion of true positives (see below for definition) relative to the number of cases (e.g. the number of patients correctly classified as patients divided the number of patients); also referred to as recall |
| specificity | The proportion of true negatives (see below for definition) relative to the number of non-cases (e.g. the number of non-patients correctly classified as non-patients divided by the number of non-patients). |
| testing data | Data used for model application and evaluation. The model built on the training data is applied to testing data for prediction. |
| training data | Data used to generate a statistical model that will be applied to a testing data set |
| true negative | Non-cases correctly classified by the model (e.g., non-patients correctly classified as non-patients) |
| true positive | Cases correctly classified by the model (e.g., patients correctly classified as patients) |
| underfitting | The failure of a model to capture the relationship between the response and predictor, generally due to inadequate model complexity, resulting in poor model performance in both training and testing data |
| validation | An unbiased evaluation of a model performance on data independent from the data used to generate the model |
List of rules and key references.
| Section | Number | Rule | Key references |
|---|---|---|---|
| Validating predictive models with independent data: why and how? | #1 | Use out-of-sample prediction to generate more accurate and generalizable models | ( |
| #2 | Keep training and testing data independent | ( | |
| #3 | Use internal validation (i.e., cross-validation) as a practical solution for validating predictive models | ( | |
| #4 | Share data, code, and models to facilitate external validation and open science | ( | |
| Measuring model performance | #5 | Consider the question of interest when choosing a performance metric and properly assess statistical significance | ( |
| #6 | Be mindful of sample characteristics | ( | |
| #7 | Apply nested cross-validation or multiple comparisons correction when testing multiple models and parameters | ( | |
| Accounting for confounds and interpreting results | #8 | Check to see if you are predicting what you think you are predicting | ( |
| #9 | Do not expect one model to fit all traits, states, or populations | ( | |
| #10 | Remember: interpretability matters | ( |
Fig. 1.General workflow for a predictive modeling study using neuroimaging data. Each box illustrates a different step in a typical study, along with relevant considerations. Pertinent rules discussed in the text are highlighted in each box as appropriate.
Fig. 2.Comparison of standardized MSE for different cross-validation methods for either A) variable training data size or B) constant training data size. A) Using 200 iterations of random sampling of 500 individuals from the Human Connectome (HCP) dataset, connectome-based predictive modeling (CPM) was applied to predict a measure of fluid intelligence (PMAT) with 4 different cross-validation strategies: split-half, 5-fold, 10-fold, and leave-one-out (LOO) cross-validation. For each strategy, the size of the training data was variable (i.e. the total sample was held constant) with split-half cross-validation using the least individuals for training (N = 250) and leave-one-out using the most individuals for training (N = 499). All cross-validation strategies give similar prediction performance with leave-out-one cross-validation performing the best due to the greater amount of training data. B) In contrast, when using 200 iterations of random sampling of individuals from the HCP dataset but keeping the number of individuals in training data constant (N = 180) (i.e. the total sample for each strategy was variable), leave-out-one cross-validation exhibited the largest variance in performance. Additionally, split-half cross-validation exhibited the smallest variance in performance. These data demonstrate the bias-variance tradeoff of different cross-validation strategies. See Supplemental Methods for further methodological details.
Fig. 3.Comparison of prediction R2 calculated directly from comparing observed and predicted values and explanatory R2 calculated from linear regression. Using 200 iterations of 400 individual for training and 400 individuals for testing randomly selected from the HCP dataset, CPM was used to predict PMAT using split-half, 5-fold, 10-fold, and leave-one-out (LOO) cross-validation. Each point represents the same CPM model evaluated with prediction R2 (on the y-axis) and explanatory R2 (on the x-axis). Prediction R2 was calculated as 1 minus normalized mean squared error between the observed and predicted values (see Rule #5), while explanatory R2 was calculated as the square of the Pearson correlation between the observed and predicted values. For all cross-validation strategies, R2 from linear regression over-estimates performance when compared to R2 calculated directly from comparing observed and predicted values. This bias is the greatest at lower prediction performance and reduces for better predicting models. The line in each plot represents the y = x line. See Supplemental Methods for further methodological details.
Fig. 4.Comparison of prediction performance as a function of the number of individuals in the training data. Using 200 iterations of 400 individuals for training and 400 individuals for testing randomly selected from the HCP dataset, CPM was used to predict PMAT using a variable number of individuals in the training data, starting with 25 individual up to 400 individual in steps of 25 individuals. Each CPM model was then evaluated on the same 400 test subjects, for each iteration. Increasing the number of individuals in the training data increased the performance of the CPM model with performance beginning to plateau with >200 individuals for training. The panel on the left shows model performance evaluated with standardized MSE. The panel on the right shows model performance evaluated with Pearson’s correlation. See Supplemental Methods for further methodological details.