| Literature DB >> 24987446 |
Ma Yushuf Sharker1, Mohammed Nasser2, Jaynal Abedin1, Benjamin F Arnold3, Stephen P Luby4.
Abstract
The asset index is often used as a measure of socioeconomic status in empirical research as an explanatory variable or to control confounding. Principal component analysis (PCA) is frequently used to create the asset index. We conducted a simulation study to explore how accurately the principal component based asset index reflects the study subjects' actual poverty level, when the actual poverty level is generated by a simple factor analytic model. In the simulation study using the PC-based asset index, only 1% to 4% of subjects preserved their real position in a quintile scale of assets; between 44% to 82% of subjects were misclassified into the wrong asset quintile. If the PC-based asset index explained less than 30% of the total variance in the component variables, then we consistently observed more than 50% misclassification across quintiles of the index. The frequency of misclassification suggests that the PC-based asset index may not provide a valid measure of poverty level and should be used cautiously as a measure of socioeconomic status.Entities:
Keywords: Asset index; Principal component analysis; Socio-economic status; Wealth index
Year: 2014 PMID: 24987446 PMCID: PMC4075602 DOI: 10.1186/1742-7622-11-6
Source DB: PubMed Journal: Emerg Themes Epidemiol ISSN: 1742-7622
Data generating process in the simulation
| ● | We generated artificial latent factor |
| ● | We considered normalized loading vectors |
| ● | We generated the data matrix |
| ● | We generated five dimensional random variables using the loading vectors and standard normal errors |
| ● | We performed PCA on |
Figure 1Flowchart of the simulation.
Descriptive statistics of the number of unchanged order, and probability of misclassification into the wrong quintile for four different vectors in simulated data
| | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Minimum | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 5 | 0 | 0 | 0 | .04 |
| First quartile | 2 | 1 | 0 | 0 | 20 | 24 | 38 | 42 | .25 | .30 | .46 | .49 |
| Median | 4 | 3 | 1 | 1 | 37 | 44 | 93 | 76 | .44 | .50 | .82 | .67 |
| Mean | 7 | 6 | 4 | 2 | 38 | 50 | 68 | 69 | .41 | .51 | .65 | .66 |
| Third quartile | 7 | 6 | 4 | 3 | 52 | 92 | 97 | 97 | .55 | .82 | .89 | .87 |
| Maximum | 98 | 98 | 98 | 26 | 99 | 99 | 99 | 99 | .97 | .98 | .98 | .97 |
Figure 2Scatter plots between frequency of unchanged position and probability of misclassification (A), proportion of explained variance and frequency of unchanged position (B) and proportion of explained variance and probability of misclassification (C). The data in red refer to those simulations where the probability of misclassification were consistently more than 80% irrespective of different levels of explained proportion of variance. The data in green refer to those instances where the probability of misclassification is negatively correlated with explained proportion of variance.
Descriptive statistics of the number of unchanged order, and probability of misclassification into the wrong quintile for four different vectors for real expenditure data
| | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Minimum | 0 | 0 | 0 | 0 | 2 | 2 | 1 | 13 | 0 | 0 | 0 | .10 |
| First quartile | 1 | 1 | 0 | 0 | 56 | 63 | 73 | 71 | .55 | .59 | .67 | .65 |
| Median | 2 | 2 | 1 | 1 | 76 | 81 | 88 | 86 | .68 | .72 | .79 | .76 |
| Mean | 4 | 3 | 2 | 2 | 70 | 74 | 80 | 80 | .64 | .68 | .74 | .73 |
| Third quartile | 5 | 4 | 3 | 3 | 89 | 93 | 96 | 95 | .79 | .84 | .87 | .86 |
| Maximum | 86 | 95 | 88 | 19 | 99 | 99 | 99 | 99 | .96 | .96 | .97 | .96 |
*Real expenditure data has 112 observation, so the dispersion of position were rescaled to 100.
Figure 3Parallel coordinate plot of the elements of loading vectors. The green color corresponds to the cluster that indicates the increasing proportion of explained variance decreases the percentage of misclassification into quintiles. The dark line indicates the population loading vector based on which data were generated. Inset, the scatter plot of the proportion of explained variance and percentage of misclassification matching color with the parallel coordinate plot corresponds which estimate of loading vectors are linked with those clusters.