| Literature DB >> 32293260 |
Eun Jeong Min1, Qi Long2.
Abstract
BACKGROUND: Multiple co-inertia analysis (mCIA) is a multivariate analysis method that can assess relationships and trends in multiple datasets. Recently it has been used for integrative analysis of multiple high-dimensional -omics datasets. However, its estimated loading vectors are non-sparse, which presents challenges for identifying important features and interpreting analysis results. We propose two new mCIA methods: 1) a sparse mCIA method that produces sparse loading estimates and 2) a structured sparse mCIA method that further enables incorporation of structural information among variables such as those from functional genomics.Entities:
Keywords: -omics data; Gene network information; High-dimensional data; Integrative analysis; Multiple co-inertia analysis; Network penalty; Structural information; l 0 penalty
Mesh:
Substances:
Year: 2020 PMID: 32293260 PMCID: PMC7157996 DOI: 10.1186/s12859-020-3455-4
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Simulation designs for each scenario and corresponding true loading vectors. All true vectors are normalized to have l2-norm 1
Simulation results using sparse mCIA are shown. Sensitivity (Sens), Specificity (Spec), and Matthew’s correlation coefficient (MCC) for feature selection performance and Angle for estimation performance are calculated. 5-fold cross validation is used to choose the best tuning parameter combination in each method. Values within parenthesis are standard errors
| sparse multiple CIA | mCIA | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| scen | Sens | Spec | MCC | Angle | Sens | Spec | MCC | Angle | Sens | Spec | MCC | Angle | Angle | Angle | Angle |
| 1 | 0.675 | 0.991 | 0.754 | 0.885 | 0.74 | 0.991 | 0.803 | 0.901 | 0.77 | 0.991 | 0.82 | 0.905 | 0.882 | 0.847 | 0.830 |
| (0.285) | (0.018) | (0.161) | (0.081) | (0.205) | (0.014) | (0.102) | (0.052) | (0.155) | (0.012) | (0.071) | (0.037) | (0.025) | (0.028) | (0.025) | |
| 2 | 0.754 | 0.974 | 0.781 | 0.901 | 0.759 | 0.966 | 0.762 | 0.886 | 0.755 | 0.96 | 0.743 | 0.875 | 0.879 | 0.847 | 0.833 |
| (0.130) | (0.032) | (0.058) | (0.028) | (0.089) | (0.027) | (0.046) | (0.024) | (0.071) | (0.022) | (0.041) | (0.021) | (0.024) | (0.027) | (0.023) | |
| 3 | 0.711 | 0.996 | 0.794 | 0.904 | 0.776 | 0.996 | 0.846 | 0.924 | 0.813 | 0.996 | 0.87 | 0.933 | 0.933 | 0.915 | 0.897 |
| (0.316) | (0.012) | (0.200) | (0.095) | (0.231) | (0.009) | (0.134) | (0.066) | (0.177) | (0.007) | (0.096) | (0.047) | (0.011) | (0.011) | (0.015) | |
| 4 | 0.826 | 0.982 | 0.848 | 0.937 | 0.846 | 0.981 | 0.857 | 0.936 | 0.845 | 0.977 | 0.845 | 0.928 | 0.933 | 0.915 | 0.897 |
| (0.145) | (0.029) | (0.069) | (0.033) | (0.100) | (0.022) | (0.040) | (0.020) | (0.077) | (0.020) | (0.040) | (0.018) | (0.011) | (0.011) | (0.015) | |
| 5 | 0.771 | 0.986 | 0.816 | 0.908 | 0.763 | 0.989 | 0.812 | 0.902 | 0.764 | 0.991 | 0.819 | 0.903 | 0.882 | 0.847 | 0.83 |
| (0.162) | (0.020) | (0.079) | (0.042) | (0.159) | (0.015) | (0.077) | (0.040) | (0.157) | (0.012) | (0.074) | (0.039) | (0.024) | (0.028) | (0.025) | |
| 6 | 0.812 | 0.93 | 0.757 | 0.897 | 0.783 | 0.944 | 0.742 | 0.879 | 0.767 | 0.954 | 0.738 | 0.871 | 0.883 | 0.85 | 0.825 |
| (0.081) | (0.042) | (0.046) | (0.023) | (0.078) | (0.031) | (0.049) | (0.023) | (0.078) | (0.024) | (0.051) | (0.027) | (0.023) | (0.025) | (0.03) | |
| 7 | 0.839 | 0.99 | 0.873 | 0.941 | 0.836 | 0.993 | 0.875 | 0.938 | 0.837 | 0.994 | 0.878 | 0.939 | 0.933 | 0.912 | 0.9 |
| (0.161) | (0.017) | (0.087) | (0.043) | (0.159) | (0.013) | (0.083) | (0.041) | (0.160) | (0.010) | (0.082) | (0.042) | (0.011) | (0.014) | (0.013) | |
| 8 | 0.88 | 0.959 | 0.851 | 0.942 | 0.865 | 0.968 | 0.847 | 0.933 | 0.863 | 0.975 | 0.854 | 0.933 | 0.933 | 0.913 | 0.899 |
| (0.077) | (0.039) | (0.044) | (0.017) | (0.076) | (0.026) | (0.040) | (0.017) | (0.071) | (0.021) | (0.036) | (0.015) | (0.011) | (0.014) | (0.013) | |
Simulation results using structured sparse mCIA are shown. Sensitivity (Sens), Specificity (Spec), and Matthews correlation coefficient (MCC) for feature selection performance and Angle for estimation performance are calculated. 5-fold cross validation is used to choose the best tuning parameter combination in each method. Values within parenthesis are standard errors
| structured sparse multiple CIA | mCIA | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| scenario | Sens | Spec | MCC | Angle | Sens | Spec | MCC | Angle | Sens | Spec | MCC | Angle | Angle | Angle | Angle |
| 1 | 0.71 | 0.994 | 0.786 | 0.897 | 0.767 | 0.993 | 0.827 | 0.913 | 0.79 | 0.992 | 0.837 | 0.915 | 0.882 | 0.847 | 0.830 |
| (0.284) | (0.011) | (0.166) | (0.088) | (0.204) | (0.009) | (0.106) | (0.056) | (0.154) | (0.008) | (0.073) | (0.041) | (0.025) | (0.028) | (0.025) | |
| 2 | 0.79 | 0.979 | 0.814 | 0.918 | 0.787 | 0.97 | 0.789 | 0.901 | 0.774 | 0.962 | 0.761 | 0.885 | 0.879 | 0.847 | 0.833 |
| (0.127) | (0.021) | (0.058) | (0.030) | (0.089) | (0.018) | (0.046) | (0.024) | (0.068) | (0.016) | (0.041) | (0.022) | (0.024) | (0.027) | (0.023) | |
| 3 | 0.748 | 0.995 | 0.816 | 0.915 | 0.807 | 0.996 | 0.863 | 0.934 | 0.838 | 0.996 | 0.884 | 0.941 | 0.933 | 0.915 | 0.897 |
| (0.300) | (0.010) | (0.186) | (0.092) | (0.221) | (0.008) | (0.126) | (0.064) | (0.171) | (0.006) | (0.091) | (0.047) | (0.011) | (0.011) | (0.015) | |
| 4 | 0.854 | 0.987 | 0.875 | 0.947 | 0.867 | 0.984 | 0.877 | 0.945 | 0.862 | 0.979 | 0.861 | 0.937 | 0.933 | 0.915 | 0.897 |
| (0.142) | (0.016) | (0.072) | (0.034) | (0.097) | (0.014) | (0.042) | (0.021) | (0.074) | (0.013) | (0.038) | (0.018) | (0.011) | (0.011) | (0.015) | |
| 5 | 0.798 | 0.986 | 0.833 | 0.919 | 0.791 | 0.989 | 0.831 | 0.913 | 0.793 | 0.992 | 0.838 | 0.915 | 0.882 | 0.847 | 0.83 |
| (0.162) | (0.016) | (0.075) | (0.042) | (0.162) | (0.012) | (0.076) | (0.043) | (0.160) | (0.009) | (0.073) | (0.042) | (0.024) | (0.028) | (0.025) | |
| 6 | 0.83 | 0.939 | 0.781 | 0.911 | 0.803 | 0.951 | 0.768 | 0.893 | 0.785 | 0.959 | 0.76 | 0.884 | 0.883 | 0.85 | 0.825 |
| (0.069) | (0.029) | (0.042) | (0.020) | (0.069) | (0.020) | (0.043) | (0.021) | (0.065) | (0.017) | (0.043) | (0.024) | (0.023) | (0.025) | (0.03) | |
| 7 | 0.852 | 0.993 | 0.887 | 0.947 | 0.848 | 0.994 | 0.886 | 0.944 | 0.849 | 0.996 | 0.89 | 0.945 | 0.933 | 0.912 | 0.9 |
| (0.158) | (0.011) | (0.087) | (0.044) | (0.157) | (0.008) | (0.083) | (0.043) | (0.156) | (0.006) | (0.081) | (0.043) | (0.011) | (0.014) | (0.013) | |
| 8 | 0.873 | 0.968 | 0.859 | 0.945 | 0.861 | 0.975 | 0.857 | 0.938 | 0.86 | 0.981 | 0.864 | 0.937 | 0.933 | 0.913 | 0.899 |
| (0.076) | (0.025) | (0.039) | (0.018) | (0.077) | (0.017) | (0.039) | (0.018) | (0.072) | (0.014) | (0.035) | (0.016) | (0.011) | (0.014) | (0.013) | |
For each method, the first two columns show the number of nonzero elements in the first two estimated coefficient loadings of three datasets, the Affymetrix, the Agilent, and the protein dataset respectively. Next four columns contain pseudo-eigenvalues calculated using the estimated coefficient loadings from the training dataset. Last four columns include proportions of pseudo-eigenvalues to the sum of total eigenvalues for each dataset
| # of nonzeros | Pseudo Eigenvalues | % of variability explained | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| test dataset | whole dataset | test dataset | whole dataset | |||||||
| 1st | 2nd | 1st | 1st + 2nd | 1st | 1st + 2nd | 1st | 1st + 2nd | 1st | 1st + 2nd | |
| mCIA | (491,488,94) | (491,488,94) | 36065.92 | 33447.03 | 282991.70 | 218372.50 | 0.088 | 0.169 | 0.129 | 0.229 |
| smCIA | (250,30,20) | (100,80,15) | 31161.89 | 21283.77 | 208966.30 | 157045.80 | 0.076 | 0.127 | 0.095 | 0.167 |
| ssmCIA | (300,80,15) | (400,15,30) | 34611.11 | 36793.08 | 239050.80 | 239050.80 | 0.084 | 0.173 | 0.109 | 0.218 |
Fig. 1From the top to bottom, each row shows the results from mCIA, smCIA, and ssmCIA method respectively. From left to right, each column represents the sample space in , the gene space of the Affymetrix dataset in , the gene space of the Agilent dataset in , and the gene space of the proteomics dataset in . For three panels in the first column, the estimates of the first loading vectors are used. Each different colors represent different cell lines, breast (BR), melanoma (ME), colon (CO), ovarian (OV), renal (RE), lung (LC), central nervous system (CNS, glioblastoma), prostate (PR) cancers and leukemia (LE). For the remaining plots, the estimates of the first two loading vectors are used. Also, colored and labeled points in the plots are top 20 genes that are most distant from the origin, which are more significant compared to other genes. Complete lists of top 20 genes for each panel can be found in the supplementary materials