| Literature DB >> 32451420 |
Omri Har-Shemesh1, Rick Quax1,2, J Stephen Lansing3,4,5,6, Peter M A Sloot7,8,9,10,11.
Abstract
The analysis of questionnaires often involves representing the high-dimensional responses in a low-dimensional space (e.g., PCA, MCA, or t-SNE). However questionnaire data often contains categorical variables and common statistical model assumptions rarely hold. Here we present a non-parametric approach based on Fisher Information which obtains a low-dimensional embedding of a statistical manifold (SM). The SM has deep connections with parametric statistical models and the theory of phase transitions in statistical physics. Firstly we simulate questionnaire responses based on a non-linear SM and validate our method compared to other methods. Secondly we apply our method to two empirical datasets containing largely categorical variables: an anthropological survey of rice farmers in Bali and a cohort study on health inequality in Amsterdam. Compare to previous analysis and known anthropological knowledge we conclude that our method best discriminates between different behaviours, paving the way to dimension reduction as effective as for continuous data.Entities:
Year: 2020 PMID: 32451420 PMCID: PMC7248094 DOI: 10.1038/s41598-020-63760-8
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1The proposed algorithm written in pseudo-code. The individual steps are explained in the main text.
Figure 2An example of the curve Eq. (5) in three dimensions with values of k = 1, 2 and m = 1, 2.
Figure 3Low dimensional embeddings of the curves with k = 1, 2, N−1 for K = 20 and K = 50 groups. For all simulations m = 3,N = 8 and N = 3. The simulations with K = 20 had 25 responses per group and the simulations with K = 50 had 50 responses per group. Colours represent the multipartite information for each distribution along the line.
Figure 4Comparison of MCA and FI in the analysis of the data collected on the Subaks in Bali. The Subaks that were previously identified as belonging to different regimes are highlighted, in addition to newly discovered outliers “Pakudui” and “Subak Dukuh, Kapal”. The color scale indicates the elevation of the Subaks which indicates their dependence on other Subaks to provide water.
Figure 5Low-dimensional embedding of the responses, divided into ethnic groups and smoker status. Smoker status is designated by square, triangle and circle which stand for “never smoked”, “quit smoking” and “currently smoking” respectively. The Ghanaian group is highlighted by an ellipsis on the left plot.