| Literature DB >> 27660635 |
Inci M Baytas1, Kaixiang Lin1, Fei Wang2, Anil K Jain1, Jiayu Zhou1.
Abstract
Principal component analysis (PCA) is a dimensionality reduction and data analysis tool commonly used in many areas. The main idea of PCA is to represent high-dimensional data with a few representative components that capture most of the variance present in the data. However, there is an obvious disadvantage of traditional PCA when it is applied to analyze data where interpretability is important. In applications, where the features have some physical meanings, we lose the ability to interpret the principal components extracted by conventional PCA because each principal component is a linear combination of all the original features. For this reason, sparse PCA has been proposed to improve the interpretability of traditional PCA by introducing sparsity to the loading vectors of principal components. The sparse PCA can be formulated as an ℓ1 regularized optimization problem, which can be solved by proximal gradient methods. However, these methods do not scale well because computation of the exact gradient is generally required at each iteration. Stochastic gradient framework addresses this challenge by computing an expected gradient at each iteration. Nevertheless, stochastic approaches typically have low convergence rates due to the high variance. In this paper, we propose a convex sparse principal component analysis (Cvx-SPCA), which leverages a proximal variance reduced stochastic scheme to achieve a geometric convergence rate. We further show that the convergence analysis can be significantly simplified by using a weak condition which allows a broader class of objectives to be applied. The efficiency and effectiveness of the proposed method are demonstrated on a large-scale electronic medical record cohort.Entities:
Keywords: Convex PCA; Proximal mapping; Sparse PCA
Year: 2016 PMID: 27660635 PMCID: PMC5018037 DOI: 10.1186/s13637-016-0045-x
Source DB: PubMed Journal: EURASIP J Bioinform Syst Biol ISSN: 1687-4145
Fig. 1Convergence for synthetic data. Convergence of the proposed stochastic Cvx-SPCA with (Prox-SVRG) and without variance reduction (prox-SGD). Proximal stochastic gradient with variance reduction has a faster convergence rate, since the variance caused by random sampling is bounded in Prox-SVRG
Fig. 2Convergence of sparse pattern in the log scale. Cvx-SPCA with Prox-SGD takes 275 iterations, whereas Cvx-SPCA with Prox-SVRG takes 45 iterations to converge a similar sparsity pattern
Running times (in seconds) of different SPCA algorithms
| Sample size | Cvx-SPCA | [ | [ | [ |
|---|---|---|---|---|
|
| 20.9 | 207.1 | 48.7 | 3002 |
|
| 26.2 | 466.9 | 78.3 | 3237.4 |
|
| 35.6 | 2737.06 | 2661.7 | 5276.93 |
|
| 35.8 | 3408.59 | 3568 | 5274.26 |
Since proposed Cvx-SPCA does not depend on eigenvalue decomposition or semi-definite programming, it is more scalable in terms of the sample size. It also requires less iterations to reach a desired sparsity
Fig. 3Regularization path for Cvx-SPCA. We checked whether the known principal component can be recovered through the path to be able confirm that this is a valid regularization path. When regularization term was around −0.11 (dashed line) in logarithmic scale, we could exactly recover the non-zero loading values of the known principal component which was used to generate the data
Fig. 4Patient distribution of demographic groups. We used only diagnoses/diseases which have explicit information about demographic of the patient while sub-sampling the patients. We can observe that each group of patient has a similar trend. Most of the patients have 1–50 diagnoses entered into the record
Fig. 5Patient distribution. We observe that the majority of the patients just have very few records
We sample patients who have female, male, child, and old people related features. These samples may overlap with each other. For instance, a patient may have dementia and a prostate problem together. We did not include other problems such as hypertension or kidney problems which can be encountered in every age and both genders into these groups of patients
| Demographic | Number of features | Number of patients |
|---|---|---|
| Female | 1268 | 130,035 |
| Male | 106 | 24,184 |
| Old | 66 | 2060 |
| Child | 596 | 38,434 |
Fig. 6Convergence for 20 epochs of Cvx-SPCA for different number of patients
EMR data features which contributes the output dimensions after Cvx-SPCA algorithm was applied to the whole patient population. Most frequently observed problems are infections, injuries, pregnancy, and delivery related problems and cancer types
| ICD9 code | Description |
|---|---|
| 7 | Balantidiasis/infectious |
| 72 | Mumps orchitis/infectious |
| 115 | Infection by histoplasma capsulatum |
| 266 | Ariboflavinosis/metabolic disorder |
| 507 | Pneumonitis/bacterial |
| 695 | Toxic erythema/dermatological |
| 697 | Lichen planus/dermatological |
| 761 | Incompetent cervix affecting fetus or newborn |
| 795 | Abnormal glandular papanicolaou smear of cervix |
| 924 | Contusion of thigh/injury |
Output EMR data features which contributes the output dimensions after applying the proposed algorithm to the subset of patients who have female-related problems. We could observe female-specific problems and other common diseases such as heart problems and anemia
| ICD9 code | Description |
|---|---|
| 281 | Pernicious anemia |
| 392 | Valvular and rheumatic heart disease |
| 614 | Female genital disorders |
| 778 | Serious perinatal problem affecting newborn |
| 905 | Major head injury |
Output EMR data features which contributes the output dimensions after applying the proposed algorithm to the subset of patients who have male related problems. We could observe a prostate problem which is directly related male patients. In addition, we can also see other common problems such as injuries
| ICD9 code | Description |
|---|---|
| 185 | Malignant neoplasm of prostate |
| 298 | Depressive type psychosis |
| 719 | Effusion of joint |
| 800 | Closed fracture of vault of skull |
| 811 | Closed fracture of scapula |
| 860 | Traumatic pneumothorax |
Output EMR data features which contributes the output dimensions after applying the proposed algorithm to the subset of patients who have old age-related problems. Cancer is a commonly encountered problem in nearly every ages. In addition to this, we could observe disorders of nervous system and visual problems in the results
| ICD9 code | Description |
|---|---|
| 153 | Malignant neoplasm of colon |
| 173 | Other malignant neoplasm of skin |
| 337 | Disorders of the autonomic nervous system |
| 368 | Visual disturbance |
Output EMR data features which contributes the output dimensions after applying the proposed algorithm to the subset of patients who have child related problems. According to our observation, tuberculosis and bacterial infections are quite common among children. Unfortunately, leukemia is also a cancer type that is seen even in small kids
| ICD9 code | Description |
|---|---|
| 8 | Intestinal infection due to other organisms |
| 11 | Pulmonary tuberculosis |
| 78 | Other diseases due to viruses and Chlamydiae |
| 10 | Primary tuberculous infection |
| 204 | Lymphoid leukemia |