| Literature DB >> 31546899 |
Christopher W Bartlett1,2, Brett G Klamer3, Steven Buyske4, Stephen A Petrill5, William C Ray3,6.
Abstract
Informatics researchers often need to combine data from many different sources to increase statistical power and study subtle or complicated effects. Perfect overlap of measurements across academic studies is rare since virtually every dataset is collected for a unique purpose and without coordination across parties not-at-hand (i.e., informatics researchers in the future). Thus, incomplete concordance of measurements across datasets poses a major challenge for researchers seeking to combine public databases. In any given field, some measurements are fairly standard, but every organization collecting data makes unique decisions on instruments, protocols, and methods of processing the data. This typically denies literal concatenation of the raw data since constituent cohorts do not have the same measurements (i.e., columns of data). When measurements across datasets are similar prima facie, there is a desire to combine the data to increase power, but mixing non-identical measurements could greatly reduce the sensitivity of the downstream analysis. Here, we discuss a statistical method that is applicable when certain patterns of missing data are found; namely, it is possible to combine datasets that measure the same underlying constructs (or latent traits) when there is only partial overlap of measurements across the constituent datasets. Our method, ROSETTA empirically derives a set of common latent trait metrics for each related measurement domain using a novel variation of factor analysis to ensure equivalence across the constituent datasets. The advantage of combining datasets this way is the simplicity, statistical power, and modeling flexibility of a single joint analysis of all the data. Three simulation studies show the performance of ROSETTA on datasets with only partially overlapping measurements (i.e., systematically missing information), benchmarked to a condition of perfectly overlapped data (i.e., full information). The first study examined a range of correlations, while the second study was modeled after the observed correlations in a well-characterized clinical, behavioral cohort. Both studies consistently show significant correlations >0.94, often >0.96, indicating the robustness of the method and validating the general approach. The third study varied within and between domain correlations and compared ROSETTA to multiple imputation and meta-analysis as two commonly used methods that ostensibly solve the same data integration problem. We provide one alternative to meta-analysis and multiple imputation by developing a method that statistically equates similar but distinct manifest metrics into a set of empirically derived metrics that can be used for analysis across all datasets.Entities:
Keywords: data blending; data integration; data pools; databases; informatics
Mesh:
Year: 2019 PMID: 31546899 PMCID: PMC6771148 DOI: 10.3390/genes10090727
Source DB: PubMed Journal: Genes (Basel) ISSN: 2073-4425 Impact factor: 4.096
Figure 1The ROSETTA flow of information. The method consists of three steps to get from the constituent datasets that are the input to ROSETTA in this example, to the final harmonized single output dataset. (A) The constituent datasets are concatenated and missing data is shown with an “X”. We assume the datasets are independent in terms of the rows; in biology, these are typically the different subjects that were measured. The columns are the nine measurements (V1-V9) that occurred across the three constituent datasets illustrated here. Importantly, no measurement was common to all three datasets, preventing a simple joint analysis on at least that one common measure, but the lack of complete overlap for any single measurement will not preclude using the Rosetta pipeline. We also applied color to show that the nine measurements come from three unique measurement domains. In the simulation study, we manipulate the strength of the correlations between the domains, but we expect that Rosetta is most useful when the domains have non-zero correlations. (B) The first step is to construct the pairwise correlation matrix for all measures across all datasets. In panel B, we show that logical intersections of the datasets allow for each domain to have a complete pairwise correlation matrix despite the pattern of missing data. (C) The same logic from panel B can be extended to show that the entire 9x9 matrix can be successfully estimated. The second step is to construct the geometry for the factor analysis using the 9x9 correlation matrix from step one (in panel C). The factor analysis provides a set of linear weights for combining the measurements into factor scores. (D) Importantly, the correlation between the factors will be used in the next step. (E) The third step is to apply the factor loadings and the correlations between the factors from the second step as a constraint for each constituent dataset (using the math from confirmatory factor analysis). Factor loadings are set equal to zero when a measure is not present in a given dataset, then the constraint on the correlations between the factors ensures equivalence of the factors between the datasets. While this third step is similar to the hypothesis testing of confirmatory factor analysis, where a model from one dataset is applied to a novel dataset, in ROSETTA the model was derived over all datasets and that same model is being used as a constraint when applied to each constituent dataset (i.e., Rosetta is not a hypothesis testing procedure). (F) The final result is a complete dataset of the domain factor scores for analysis. Rather than outputting nine variables (such as would occur with multiple imputation), Rosetta output three domain factor scores per subject (labeled D1–D3).
Parameterized cross-correlation structure of the latent traits for simulation.
|
|
Figure 2Study 1: Comparison of ROSETTA trait scores on incomplete data versus latent trait scores on complete data. We compared values of the latent traits from the full dataset condition (ground truth) on the x-axis to the incomplete matched datasets derived with ROSETTA on the y-axis.
Study 1: Observed Correlations between ROSETTA and Full Data.
| Trait 1 – Full Data | Trait 2 – Full Data | Trait 3 – Full Data | |
|---|---|---|---|
| Trait 1 - ROSETTA | 0.942 | 0.247 | 0.435 |
| Trait 2 - ROSETTA | 0.262 | 0.993 | 0.409 |
| Trait 3 - ROSETTA | 0.470 | 0.410 | 0.980 |
Study 1: Expected Correlations Between Measures as Determined by Full Data.
| Trait 1 – Full Data | Trait 2 – Full Data | Trait 3 – Full Data | |
|---|---|---|---|
| Trait 1 – Full data | 1.000 | 0.252 | 0.451 |
| Trait 2 – Full data | 1.000 | 0.402 | |
| Trait 3 – Full data | 1.000 |
Figure 3Study 2: Data Modeled After a Clinical Behavioral Dataset, Comparison of ROSETTA scores versus scores from complete data. We compared values of the latent traits from the full dataset condition (ground truth) on the x-axis to the incomplete matched datasets derived with ROSETTA on the y-axis.
Study 2: Observed Correlations between ROSETTA and Full Data.
| Trait 1 – Full Data | Trait 2 – Full Data | Trait 3 – Full Data | |
|---|---|---|---|
| Trait 1 - ROSETTA | 0.964 | 0.959 | 0.738 |
| Trait 2 - ROSETTA | 0.960 | 0.963 | 0.780 |
| Trait 3 - ROSETTA | 0.731 | 0.774 | 0.957 |
Study 2: Expected Correlations Between Measures as Determined by Full Data.
| Trait 1 – Full Data | Trait 2 – Full Data | Trait 3 – Full Data | |
|---|---|---|---|
| Trait 1 – Full data | 1.000 | 0.952 | 0.741 |
| Trait 2 – Full data | 1.000 | 0.763 | |
| Trait 3 – Full data | 1.000 |
Figure 4Study 3: Average –log(p-value) on the y-axis for Each Method by Within-Domain Correlation (panels) and Between-Domain Correlation (x-axis). Provided the within-domain correlation is >0.3 on average, ROSETTA shows a clear advantage for downstream analysis. When the within-domain correlation is 0.2 or less, the current implementation of ROSETTA runs into numerical issues and can no longer be applied.
Study 3: Power of Each Method by Within-Domain and Between-Domain Correlation.
| Within | Between | Rosetta-c | Truth-c | Imputation-c | Meta-c |
|---|---|---|---|---|---|
| 0.9 | 0.4 | 1 | 1 | 1 | 1 |
| 0.9 | 0.3 | 1 | 1 | 1 | 1 |
| 0.9 | 0.2 | 1 | 1 | 1 | 1 |
| 0.9 | 0.1 | 0.92 | 1 | 0.92 | 0.76 |
| 0.75 | 0.4 | 1 | 1 | 1 | 1 |
| 0.75 | 0.3 | 1 | 1 | 1 | 1 |
| 0.75 | 0.2 | 1 | 0.88 | 0.88 | 0.92 |
| 0.75 | 0.1 | 0.96 | 0.8 | 0.48 | 0.64 |
| 0.6 | 0.4 | 1 | 0.96 | 0.96 | 0.92 |
| 0.6 | 0.3 | 1 | 0.88 | 0.84 | 0.84 |
| 0.6 | 0.2 | 1 | 0.76 | 0.68 | 0.88 |
| 0.6 | 0.1 | 0.96 | 0.6 | 0.56 | 0.64 |
| 0.45 | 0.4 | 1 | 0.8 | 0.72 | 0.76 |
| 0.45 | 0.3 | 1 | 0.68 | 0.6 | 0.76 |
| 0.45 | 0.2 | 1 | 0.68 | 0.52 | 0.6 |
| 0.45 | 0.1 | 0.96 | 0.4 | 0.28 | 0.52 |
| 0.3 | 0.4 | 1 | 0.64 | 0.6 | 0.44 |
| 0.3 | 0.3 | 1 | 0.48 | 0.36 | 0.44 |
| 0.3 | 0.2 | 0.96 | 0.36 | 0.36 | 0.48 |
| 0.3 | 0.1 | 0.8 | 0.36 | 0.48 | 0.48 |
| 0.2 | 0.4 | 0.32 | 0.4 | 0.24 | 0.32 |
| 0.2 | 0.3 | 0.17 | 0.28 | 0.48 | 0.48 |
| 0.2 | 0.2 | 0 | 0.24 | 0.52 | 0.4 |
| 0.2 | 0.1 | 0 | 0.24 | 0.48 | 0.44 |
| 0.1 | 0.4 | 0 | 0.28 | 0.72 | 0.4 |
| 0.1 | 0.3 | 0 | 0.24 | 0.2 | 0.36 |
| 0.1 | 0.2 | 0 | 0.2 | 0.4 | 0.36 |
| 0.1 | 0.1 | 0 | 0.2 | 0.6 | 0.36 |