| Literature DB >> 28886057 |
Yi-Sheng Chao1, Chao-Jung Wu2.
Abstract
Producing indices composed of multiple input variables has been embedded in some data processing and analytical methods. We aim to test the feasibility of creating data-driven indices by aggregating input variables according to principal component analysis (PCA) loadings. To validate the significance of both the theory-based and data-driven indices, we propose principles to review innovative indices. We generated weighted indices with the variables obtained in the first years of the two-year panels in the Medical Expenditure Panel Survey initiated between 1996 and 2011. Variables were weighted according to PCA loadings and summed. The statistical significance and residual deviance of each index to predict mortality in the second years was extracted from the results of discrete-time survival analyses. There were 237,832 surviving the first years of panels, represented 4.5 billion civilians in the United States, of which 0.62% (95% CI = 0.58% to 0.66%) died in the second years of the panels. Of all 134,689 weighted indices, there were 40,803 significantly predicting mortality in the second years with or without the adjustment of age, sex and races. The significant indices in the both models could at most lead to 10,200 years of academic tenure for individual researchers publishing four indices per year or 618.2 years of publishing for journals with annual volume of 66 articles. In conclusion, if aggregating information based on PCA loadings, there can be a large number of significant innovative indices composing input variables of various predictive powers. To justify the large quantities of innovative indices, we propose a reporting and review framework for novel indices based on the objectives to create indices, variable weighting, related outcomes and database characteristics. The indices selected by this framework could lead to a new genre of publications focusing on meaningful aggregation of information.Entities:
Mesh:
Year: 2017 PMID: 28886057 PMCID: PMC5590867 DOI: 10.1371/journal.pone.0183997
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Fig 1A flow chart of data linkage, data processing, feature selection, principal component analysis and index generation with the Medical Expenditure Panel Survey (MEPS) 1996 to 2012.
Proposed publication cycle for weighted indices.
| Repeat until running out of indices for desired outcomes | ||||
|---|---|---|---|---|
| Stages in publication cycles | Preparation | 1st month | 2nd month | 3rd month |
| Search for all significant indices | Generate theories or hypotheses to introduce new indices | Publish PCA-based indices | Validate the published index | |
|
Select a database and a target outcome Generate PCA-based indices Significance testing regarding one particular outcome Summarize the number of significant indices |
Select a significant index consciously or randomly Create index names and attach new theories or hypotheses Publish new theories |
Use the statistically significant weighted index to support new theories or hypotheses |
Emphasize the importance and significance of the index by demonstrating its significant role in other outcomes, other data sources, other subpopulations, other contexts and so on. | |
The characteristics of the interviewees in the first to 16th Medical Expenditure Panel Survey.
| Panels | Begin years | Sample sizes (n) | Female (%) | Races (%) | Died in the 2nd years of panels (%) | |||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Unweighted | Weighted | (95% CI) | (95% CI) | White | (95% CI) | Black | (95% CI) | American_Indians/Alaska_natives | (95% CI) | Asian | (95% CI) | Native_Hawaiian/Pacific_islanders | (95% CI) | Multiple_races | (95% CI) | (95% CI) | ||||
| 1996 | 18,847 | 260,676,916 | (243,877,393 to 277,476,439) | 51.29% | (50.59% to 52.00%) | 81.77% | (80.12% to 83.43%) | 13.09% | (11.65% to 14.53%) | 1.30% | (0.86% to 1.75%) | 3.76% | (2.99% to 4.52%) | 0.00% | (0.00% to 0.00%) | 0.08% | (0.02% to 0.13%) | 0.14% | (0.08% to 0.21%) | |
| 1997 | 11,917 | 266,865,458 | (238,549,994 to 295,180,923) | 51.21% | (50.39% to 52.03%) | 82.76% | (80.68% to 84.84%) | 13.04% | (11.06% to 15.02%) | 0.91% | (0.59% to 1.22%) | 3.29% | (2.38% to 4.20%) | 0.00% | (0.00% to 0.00%) | 0.00% | (0.00% to 0.00%) | 0.66% | (0.49% to 0.82%) | |
| 1998 | 9,704 | 268,961,612 | (235,532,351 to 302,390,873) | 51.25% | (50.29% to 52.21%) | 81.41% | (79.07% to 83.75%) | 13.15% | (11.02% to 15.28%) | 0.59% | (0.35% to 0.83%) | 4.85% | (3.64% to 6.07%) | 0.00% | (0.00% to 0.00%) | 0.00% | (0.00% to 0.00%) | 0.69% | (0.50% to 0.88%) | |
| 1999 | 12,833 | 271,359,560 | (223,667,123 to 319,051,996) | 51.19% | (50.43% to 51.96%) | 82.34% | (79.88% to 84.80%) | 13.20% | (10.63% to 15.78%) | 1.11% | (0.52% to 1.70%) | 3.34% | (2.58% to 4.10%) | 0.00% | (0.00% to 0.00%) | 0.00% | (0.00% to 0.00%) | 0.65% | (0.45% to 0.86%) | |
| 2000 | 10,000 | 275,825,314 | (228,649,135 to 323,001,494) | 51.28% | (50.30% to 52.27%) | 83.05% | (80.68% to 85.41%) | 12.95% | (10.61% to 15.29%) | 0.55% | (0.30% to 0.79%) | 3.46% | (2.62% to 4.30%) | 0.00% | (0.00% to 0.00%) | 0.00% | (0.00% to 0.00%) | 0.68% | (0.50% to 0.87%) | |
| 2001 | 20,328 | 280,432,464 | (249,937,923 to 310,927,005) | 51.17% | (50.54% to 51.80%) | 81.13% | (79.48% to 82.79%) | 12.27% | (10.64% to 13.90%) | 1.02% | (0.67% to 1.37%) | 3.83% | (3.05% to 4.61%) | 0.35% | (0.11% to 0.59%) | 1.40% | (1.10% to 1.69%) | 0.65% | (0.53% to 0.77%) | |
| 2002 | 15,513 | 282,724,249 | (253,127,157 to 312,321,342) | 51.10% | (50.34% to 51.87%) | 81.48% | (79.90% to 83.07%) | 12.33% | (10.85% to 13.81%) | 0.80% | (0.53% to 1.07%) | 3.86% | (3.10% to 4.63%) | 0.27% | (0.13% to 0.40%) | 1.26% | (0.99% to 1.53%) | 0.66% | (0.50% to 0.82%) | |
| 2003 | 15,549 | 285,244,087 | (254,060,021 to 316,428,152) | 51.08% | (50.45% to 51.70%) | 81.01% | (79.26% to 82.75%) | 12.36% | (10.70% to 14.01%) | 0.64% | (0.36% to 0.93%) | 3.92% | (3.18% to 4.67%) | 0.33% | (0.15% to 0.52%) | 1.73% | (1.32% to 2.15%) | 0.66% | (0.50% to 0.81%) | |
| 2004 | 15,398 | 287,469,219 | (262,449,744 to 312,488,694) | 51.07% | (50.31% to 51.82%) | 80.20% | (78.25% to 82.15%) | 12.49% | (10.75% to 14.23%) | 0.83% | (0.49% to 1.17%) | 4.26% | (3.40% to 5.13%) | 0.38% | (0.17% to 0.59%) | 1.84% | (1.48% to 2.20%) | 0.62% | (0.47% to 0.77%) | |
| 2005 | 14,961 | 290,237,146 | (264,335,686 to 316,138,607) | 51.08% | (50.32% to 51.83%) | 80.38% | (78.49% to 82.28%) | 12.41% | (10.80% to 14.01%) | 0.78% | (0.41% to 1.15%) | 4.33% | (3.47% to 5.19%) | 0.35% | (0.08% to 0.62%) | 1.75% | (1.37% to 2.14%) | 0.63% | (0.46% to 0.80%) | |
| 2006 | 15,871 | 292,567,761 | (270,395,864 to 314,739,659) | 51.09% | (50.42% to 51.76%) | 80.14% | (78.44% to 81.85%) | 12.38% | (10.99% to 13.78%) | 0.91% | (0.52% to 1.30%) | 4.37% | (3.52% to 5.22%) | 0.45% | (0.19% to 0.71%) | 1.74% | (1.37% to 2.10%) | 0.62% | (0.48% to 0.76%) | |
| 2007 | 11,965 | 295,618,849 | (275,773,974 to 315,463,724) | 50.97% | (50.17% to 51.76%) | 80.43% | (78.37% to 82.50%) | 12.35% | (10.56% to 14.13%) | 0.79% | (0.45% to 1.12%) | 4.30% | (3.46% to 5.13%) | 0.21% | (0.11% to 0.31%) | 1.93% | (1.47% to 2.39%) | 0.62% | (0.44% to 0.80%) | |
| 2008 | 17,510 | 297,983,418 | (281,865,082 to 314,101,755) | 50.96% | (50.22% to 51.70%) | 79.93% | (78.15% to 81.71%) | 12.37% | (10.98% to 13.76%) | 0.78% | (0.42% to 1.14%) | 4.57% | (3.74% to 5.41%) | 0.25% | (0.11% to 0.40%) | 2.09% | (1.67% to 2.51%) | 0.65% | (0.48% to 0.82%) | |
| 2009 | 15,642 | 300,419,079 | (282,428,580 to 318,409,578) | 50.94% | (50.26% to 51.63%) | 79.86% | (78.09% to 81.62%) | 12.54% | (10.97% to 14.11%) | 0.90% | (0.45% to 1.35%) | 4.67% | (3.79% to 5.55%) | 0.29% | (0.15% to 0.44%) | 1.74% | (1.37% to 2.11%) | 0.67% | (0.48% to 0.86%) | |
| 2010 | 13,977 | 303,027,202 | (286,337,440 to 319,716,964) | 51.21% | (50.42% to 51.99%) | 79.59% | (77.64% to 81.55%) | 12.32% | (10.76% to 13.87%) | 0.81% | (0.30% to 1.32%) | 4.97% | (3.96% to 5.99%) | 0.61% | (0.34% to 0.88%) | 1.70% | (1.35% to 2.05%) | 0.63% | (0.46% to 0.81%) | |
| 2011 | 17,817 | 305,055,352 | (287,349,122 to 322,761,583) | 51.16% | (50.46% to 51.86%) | 79.91% | (78.01% to 81.82%) | 12.39% | (11.01% to 13.77%) | 0.00% | (0.00% to 0.00%) | 5.16% | (4.00% to 6.32%) | 0.00% | (0.00% to 0.00%) | 2.53% | (2.11% to 2.95%) | 0.64% | (0.49% to 0.79%) | |
| All panels | 237,832 | 4,564,467,688 | (4,394,590,562 to 4,734,344,813) | 51.12% | (50.92% to 51.33%) | 80.92% | (80.17% to 81.68%) | 12.59% | (11.90% to 13.28%) | 0.79% | (0.66% to 0.92%) | 4.20% | (3.87% to 4.54%) | 0.22% | (0.17% to 0.27%) | 1.27% | (1.18% to 1.37%) | 0.62% | (0.58% to 0.66%) | |
Note: the proportions by sex and white race are not statistically different by the panels (p = 1 and 0.24 respectively). The proportions of dying in the second years of the MEPS panels are different (p < 0.01).
Fig 2The Kaplan-Meier survival curves of the interviewees in the second years of the MEPS panels.
(a) The Kaplan-Meier survival curves by sex. Chi-square = 12.23, p < 0.001. (b) The Kaplan-Meier survival curves by races. Chi-square = 27.46, p < 0.001.
Fig 3The distributions of those dying in the second years of the MEPS panels by principal components.
(a) The distribution of those dying by first and second principal components (PC1 and PC2). (b) The distribution of those dying by first and third principal components (PC1 and PC3). (c) The distribution of those dying by first and fourth principal components (PC1 and PC4). (d) The distribution of those dying by first and fifth principal components (PC1 and PC5). Note: red circles: those dying in the second years of the MEPS panels; gray circles: those surviving throughout the MEPS panels.
Coefficients of the first principal component to predict mortality in the second years of the MEPS panels.
| PC1 | PC2 | PC3 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| coef | (95% Cis) | p | coef | (95% Cis) | p | coef | (95% Cis) | p | |
| -0.030 | (-0.234 to 0.175) | 0.78 | -0.309 | (-0.329 to -0.289) | 0.78 | -0.242 | (-0.260 to -0.223) | <0.001 | |
| -6.287 | (-8.722 to -3.853) | <0.001 | -7.476 | (-7.670 to -7.282) | <0.001 | -6.906 | (-7.062 to -6.750) | <0.001 | |
| -6.512 | (-9.243 to -3.782) | <0.001 | -7.694 | (-7.949 to -7.440) | <0.001 | -7.227 | (-7.418 to -7.037) | <0.001 | |
| -6.399 | (-8.985 to -3.812) | <0.001 | -7.279 | (-7.478 to -7.080) | <0.001 | -6.989 | (-7.174 to -6.803) | <0.001 | |
| -6.459 | (-9.133 to -3.785) | <0.001 | -7.363 | (-7.558 to -7.168) | <0.001 | -6.964 | (-7.142 to -6.786) | <0.001 | |
| 0.005 | (-0.310 to 0.319) | 0.98 | -0.002 | (-0.033 to 0.030) | 0.91 | -0.019 | (-0.047 to 0.009) | 0.18 | |
| 0.007 | (-0.301 to 0.315) | 0.97 | 0.037 | (0.004 to 0.071) | 0.03 | 0.003 | (-0.029 to 0.035) | 0.87 | |
| 0.008 | (-0.307 to 0.324) | 0.96 | 0.031 | (0.001 to 0.062) | 0.04 | 0.019 | (-0.011 to 0.050) | 0.21 | |
Coefficients of the first principal component and demographics to predict mortality in the second years of the MEPS panels.
| PC1 | PC2 | PC3 | |||||||
|---|---|---|---|---|---|---|---|---|---|
| coef | (95% Cis) | p | coef | (95% Cis) | p | coef | (95% Cis) | p | |
| -0.030 | (-0.038 to -0.021) | <0.001 | -0.219 | (-0.243 to -0.194) | <0.001 | -0.100 | (-0.128 to -0.072) | <0.001 | |
| -10.136 | (-10.521 to -9.751) | <0.001 | -8.983 | (-9.369 to -8.597) | <0.001 | -10.027 | (-10.446 to -9.608) | <0.001 | |
| -10.351 | (-10.743 to -9.959) | <0.001 | -9.201 | (-9.625 to -8.777) | <0.001 | -10.391 | (-10.865 to -9.918) | <0.001 | |
| -10.227 | (-10.616 to -9.838) | <0.001 | -8.730 | (-9.101 to -8.358) | <0.001 | -10.085 | (-10.521 to -9.648) | <0.001 | |
| -10.278 | (-10.660 to -9.896) | <0.001 | -8.819 | (-9.172 to -8.465) | <0.001 | -10.013 | (-10.441 to -9.585) | <0.001 | |
| 0.082 | (0.078 to 0.087) | <0.001 | 0.050 | (0.045 to 0.054) | <0.001 | 0.073 | (0.067 to 0.078) | <0.001 | |
| -0.496 | (-0.619 to -0.374) | <0.001 | -0.628 | (-0.753 to -0.502) | <0.001 | -0.475 | (-0.597 to -0.352) | <0.001 | |
| 0.318 | (0.150 to 0.486) | <0.001 | 0.396 | (0.236 to 0.556) | <0.001 | 0.598 | (0.437 to 0.759) | <0.001 | |
| -0.472 | (-1.315 to 0.372) | 0.27 | -0.655 | (-1.523 to 0.212) | 0.14 | -0.332 | (-1.170 to 0.507) | 0.44 | |
| -0.360 | (-0.830 to 0.111) | 0.13 | -0.253 | (-0.735 to 0.229) | 0.30 | -0.096 | (-0.564 to 0.371) | 0.69 | |
| 0.671 | (-0.642 to 1.983) | 0.32 | 0.719 | (-0.581 to 2.018) | 0.28 | 0.895 | (-0.474 to 2.264) | 0.20 | |
| 0.692 | (0.167 to 1.217) | 0.01 | 0.507 | (-0.031 to 1.045) | 0.06 | 0.789 | (0.261 to 1.318) | <0.01 | |
| 0.006 | (-0.009 to 0.021) | 0.43 | -0.003 | (-0.041 to 0.035) | 0.89 | -0.028 | (-0.070 to 0.013) | 0.18 | |
| 0.009 | (-0.005 to 0.023) | 0.21 | 0.044 | (0.005 to 0.082) | 0.03 | 0.005 | (-0.043 to 0.052) | 0.84 | |
| 0.011 | (-0.003 to 0.025) | 0.12 | 0.036 | (0.000 to 0.072) | 0.05 | 0.030 | (-0.015 to 0.076) | 0.19 | |
Fig 4The p values of all PCA-based weighted indices regarding the prediction of mortality risk.
(a) P values for 134689 PCA-based indices regarding mortality risk in models that take time (in quarters) and interactions between indices and time. (b) P values for 134689 PCA-based indices regarding mortality risk in models that take age, sex, races, time (in quarters) and interactions between indices and time.
Summaries of the significance (p<0.05) of all PCA-based weighted indices.
| Unadjusted and adjusted models | Adjusted models | Unadjusted models | |||||||
|---|---|---|---|---|---|---|---|---|---|
| Insignificant indices (n) | Significant indices (n) | % of significant indices | Insignificant indices (n) | Significant indices (n) | % of significant indices | Insignificant indices (n) | Significant indices (n) | % of significant indices | |
| 159 | 208 | 56.68% | 149 | 218 | 59.40% | 78 | 289 | 78.75% | |
| 5,482 | 5,161 | 48.49% | 4,863 | 5,780 | 54.31% | 3,115 | 7,528 | 70.73% | |
| 8,191 | 6,489 | 44.20% | 7,404 | 7,276 | 49.56% | 5,006 | 9,674 | 65.90% | |
| 79,696 | 28,945 | 26.64% | 76,574 | 32,067 | 29.52% | 65,790 | 43,209 | 39.64% | |
| 93,528 | 40,803 | 30.37% | 88,990 | 45,341 | 33.75% | 73,989 | 60,700 | 45.07% | |
Fig 5Flowchart of the process of index review and evaluation.
Note: PCA: principal component analysis; PLS: partial least squares.