| Literature DB >> 31633058 |
Elias Chaibub Neto1, Abhishek Pratap1,2, Thanneer M Perumal1, Meghasyam Tummalacherla1, Phil Snyder1, Brian M Bot1, Andrew D Trister1, Stephen H Friend1,3,4, Lara Mangravite1, Larsson Omberg1.
Abstract
Collection of high-dimensional, longitudinal digital health data has the potential to support a wide-variety of research and clinical applications including diagnostics and longitudinal health tracking. Algorithms that process these data and inform digital diagnostics are typically developed using training and test sets generated from multiple repeated measures collected across a set of individuals. However, the inclusion of repeated measurements is not always appropriately taken into account in the analytical evaluations of predictive performance. The assignment of repeated measurements from each individual to both the training and the test sets ("record-wise" data split) is a common practice and can lead to massive underestimation of the prediction error due to the presence of "identity confounding." In essence, these models learn to identify subjects, in addition to diagnostic signal. Here, we present a method that can be used to effectively calculate the amount of identity confounding learned by classifiers developed using a record-wise data split. By applying this method to several real datasets, we demonstrate that identity confounding is a serious issue in digital health studies and that record-wise data splits for machine learning- based applications need to be avoided.Entities:
Keywords: Preclinical research; Statistics
Year: 2019 PMID: 31633058 PMCID: PMC6789029 DOI: 10.1038/s41746-019-0178-x
Source DB: PubMed Journal: NPJ Digit Med ISSN: 2398-6352
Fig. 1Permutation scheme to detect identity confounding. The schematic shows a toy example for a data-set with eight subjects (four cases and four controls), where each subject contributed two records. a and b show, respectively, the disease label vector and the feature matrix. c Shows distinct “subject-wise random permutations” of the disease labels, where the permutations are performed at the subject level, rather than at the record level (so that all records of a given subject are assigned either “case” or “control” labels). For example, in the first permutation, the labels of subjects 2 and 3 changed from “case” to “control”, the labels of subjects 5 and 8 changed from “control” to “case”, and the labels of subjects 1, 4, 6, and 7 remained the same. (Note that for each subject, the labels are changed across all records). The subject-wise label permutations destroy the association between the disease labels and the features, making it impossible for a classifier trained with shuffled labels to learn the disease signal. Adopting the record-wise data split strategy, with half of the records assigned to the training set (d), and the other half to the test set (e), we have that both training and test sets contain 1 record from each subject. Most importantly, in each permutation the shuffled labels of each subject are the same in both the training and test sets. For instance, in the first permutation (highlighted by the red boxes in d, e) we have that the shuffled labels of subjects 1 to 8, namely, “case”, “control”, “control”, “case”, “case”, “control”, “control”, “case”, are exactly the same in the training and test sets. Consequently, any classifier trained with the shuffled labels will still be able to learn to identify individuals, even though it cannot learn the disease signal
Fig. 2Identity confounding in digital health data sets. In all panels, the permutation null distribution is represented by the blue histogram while the observed AUC value is represented by the brown line. In all panels the permutation null distribution is shifted away from 0.5—the baseline random guess value for the AUC metric. a, b Show the results for the UCI Parkinson’s and UCI Parkinson’s MSRD data sets, respectively. Note the larger spread of the permutation null distributions (compared to the remaining panels). Panel c shows the results for the mPower voice data based on 22 subjects. Note that the observed AUC falls right in the middle of the permutation null distribution. Panel d shows the results for the tapping task based on 22 subjects. In this case, however, the observed value falls in the tail of the null distribution. c, e, and g compare the results for the mPower voice data across increasing numbers of subjects (namely, 22, 42, and 240 subjects). d, f, h show the analogous comparison for the mPower tapping data (based on 22, 48, and 290 subjects, respectively). The results were generated using the random forest classifier, and were based on 1000 permutations. See the Methods section for a description of the permutation scheme used to generated the permutation null distributions