| Literature DB >> 32788601 |
Sayantan Mitra1, Sriparna Saha2, Mohammed Hasanuzzaman3.
Abstract
In real world applications, data sets are often comprised of multiple views, which provide consensus and complementary information to each other. Embedding learning is an effective strategy for nearest neighbour search and dimensionality reduction in large data sets. This paper attempts to learn a unified probability distribution of the points across different views and generates a unified embedding in a low-dimensional space to optimally preserve neighbourhood identity. Probability distributions generated for each point for each view are combined by conflation method to create a single unified distribution. The goal is to approximate this unified distribution as much as possible when a similar operation is performed on the embedded space. As a cost function, the sum of Kullback-Leibler divergence over the samples is used, which leads to a simple gradient adjusting the position of the samples in the embedded space. The proposed methodology can generate embedding from both complete and incomplete multi-view data sets. Finally, a multi-objective clustering technique (AMOSA) is applied to group the samples in the embedded space. The proposed methodology, Multi-view Neighbourhood Embedding (MvNE), shows an improvement of approximately 2-3% over state-of-the-art models when evaluated on 10 omics data sets.Entities:
Year: 2020 PMID: 32788601 PMCID: PMC7423957 DOI: 10.1038/s41598-020-70229-1
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Different views of the data sets are combined in the probabilistic space by conflation method. The low-dimensional embedding is generated by approximating the combined probability distribution in the lower-dimensional space.
Figure 2Example of conflation technique. The red curves are the two independent distributions, yellow curve is the probability distribution obtained by averaging the probabilities, blue curve is the probability distribution obtained by averaging the data and green curve denotes the distribution obtained by conflation technique.
Figure 3Network structure of the stacked autoencoder. Output of the “feature” layer is the
Description of the different views of the data sets. The numbers in the brackets are selected features.
| Dataset | No. of features | Samples | #Clusters | ||
|---|---|---|---|---|---|
| Gene expression | miRNA expression | DNA methylation | |||
| 20510 (400) | 1046 (220) | 4885 (400) | 684 | 4 | |
| 12042 (400) | 534 (110) | 5000 (400) | 274 | 4 | |
| 12043 (400) | 800 (190) | 5000 (400) | 291 | 3 | |
| 20351 (400) | 705 (170) | 5000 (400) | 221 | 4 | |
| 20531 (400) | 705 (170) | 5000 (400) | 410 | 4 | |
| 20531 (400) | 705 (170) | 5000 (400) | 344 | 4 | |
| 20531 (400) | 705 (170) | 5000 (400) | 450 | 4 | |
| 20531 (400) | 1046 (241) | 5000 (400) | 217 | 4 | |
| 20531 (400) | 705 (170) | 5000 (400) | 212 | 4 | |
| 20531 (400) | 705 (168) | 5000 (400) | 170 | 2 | |
Figure 4Change in NMI(%) with changes in k.
Figure 5Change in NMI(%) with the changing dimension (dim) of the embedded dataset.
5 up regulated and 5 down regulated Gene markers for BRCA dataset.
| LumA | LumB | Her2 | Basal |
|---|---|---|---|
| CIRBP | ARL6IP1 | ERBB2 | DSC2 |
| TENC1 | PTGES3 | GGCT | YBX1 |
| KIF13B | PCNA | ACTR3 | FOXC1 |
| COL14A1 | PLEKHF2 | STARD3 | ANP32E |
| NTN4 | SFRS1 | GRB7 | PAPSS1 |
| TUBA1C | TRIM29 | GREB1 | XBP1 |
| FOXM1 | PPL | ESR1 | GATA3 |
| MYBL2 | SFRP1 | MAPT | ZNF552 |
| MKI67 | ZFP36L2 | TBC1D9 | MLPH |
| TPX2 | NDRG2 | BCL2 | FOXA1 |
The p-values obtained on comparing MvNE with other comparing methods in terms of NMI.
| BRCA | GBM | OVG | COAD | LIHC | LUSC | SKCM | SARC | KIRC | AML | |
|---|---|---|---|---|---|---|---|---|---|---|
| MCCA | 0.0014 | 0.0091 | 0.0031 | 0.0184 | 0.0151 | 0.0073 | 0.0043 | 0.0269 | 0.0302 | 0.02546 |
| MultiNMF | 0.0086 | 0.0076 | 0.0051 | 0.0303 | 0.0054 | 0.0019 | 0.0089 | 0.0214 | 0.0275 | 0.0139 |
| DiMSC | 0.0056 | 0.0034 | 0.0051 | 0.0291 | 0.0104 | 0.0078 | 0.0035 | 0.0235 | 0.0207 | 0.0201 |
| LRAcluster | 0.0036 | 0.0211 | 0.0215 | 0.0012 | 0.0036 | 0.0051 | 0.0062 | 0.0084 | 0.0051 | 0.0167 |
| PINS | 0.0044 | 0.0112 | 0.0008 | 0.0273 | 0.0089 | 0.0045 | 0.0244 | 0.0062 | 0.0065 | 0.0163 |
| SNF | 0.0126 | 0.0057 | 0.0086 | 0.0076 | 0.0277 | 0.0062 | 0.0119 | 0.00634 | 0.0026 | 0.0062 |
| iClusterBayes | 0.0045 | 0.0118 | 0.0086 | 0.0124 | 0.0357 | 0.0023 | 0.0076 | 0.0048 | 0.0034 | 0.0043 |
| MVDA | 0.0042 | 0.0132 | 0.0071 | 0.0051 | 0.0043 | 0.0073 | 0.0009 | 0.0015 | 0.0021 | 0.0077 |
| AvgProb | 0.0051 | 0.0178 | 0.0012 | 0.0012 | 0.0057 | 0.0031 | 0.0064 | 0.0068 | 0.0074 | 0.0053 |
| AvgData | 0.0061 | 0.0113 | 0.0091 | 0.0051 | 0.0049 | 0.0028 | 0.0098 | 0.0041 | 0.0092 | 0.0062 |
Comparison results in terms of NMI.
| BRCA | GBM | OVG | COAD | LIHC | LUSC | SKCM | SARC | KIRC | AML | |
|---|---|---|---|---|---|---|---|---|---|---|
| MvNE | ||||||||||
| MCCA | 0.2086 | 0.2865 | 0.0731 | 0.0784 | 0.0546 | 0.2031 | 0.0952 | 0.0993 | 0.0765 | 0.3046 |
| MultiNMF | 0.3001 | 0.3606 | 0.0713 | 0.0657 | 0.0489 | 0.2401 | 0.0981 | 0.0836 | 0.0755 | 0.2787 |
| DiMSC | 0.3856 | 0.4089 | 0.1015 | 0.0917 | 0.0806 | 0.2598 | 0.0996 | 0.1051 | 0.0892 | 0.3166 |
| LRAcluster | 0.0146 | 0.0532 | 0.0304 | 0.0328 | 0.0573 | 0.0672 | 0.0483 | 0.0475 | 0.0389 | 0.3629 |
| PINS | 0.0118 | 0.0153 | 0.0095 | 0.0459 | 0.0348 | 0.0237 | 0.0382 | 0.0262 | 0.0279 | 0.2219 |
| SNF | 0.3581 | 0.026 | 0.0068 | 0.0332 | 0.0129 | 0.0082 | 0.0088 | 0.0233 | 0.0908 | 0.4349 |
| iClusterBayes | 0.0121 | 0.0306 | 0.0081 | 0.0106 | 0.0258 | 0.0112 | 0.0044 | 0.0177 | 0.0108 | 0.0894 |
| MVDA | 0.3912 | 0.4213 | 0.1063 | 0.0993 | 0.0775 | 0.2694 | 0.1195 | 0.1121 | 0.0955 | 0.2871 |
| AvgProb | 0.0119 | 0.0281 | 0.0092 | 0.0151 | 0.0261 | 0.0108 | 0.0038 | 0.0107 | 0.0109 | 0.0604 |
| AvgData | 0.3804 | 0.4053 | 0.1035 | 0.09193 | 0.0705 | 0.2644 | 0.1007 | 0.0891 | 0.1008 | 0.3014 |
Comparison results in terms of ARI.
| BRCA | GBM | OVG | COAD | LIHC | LUSC | SKCM | SARC | KIRC | AML | |
|---|---|---|---|---|---|---|---|---|---|---|
| MvNE | ||||||||||
| MCCA | 0.1903 | 0.2243 | 0.03031 | 0.0182 | 0.0031 | 0.1012 | 0.0093 | 0.0188 | 0.0195 | 0.1846 |
| MultiNMF | 0.2107 | 0.25476 | 0.04112 | 0.0163 | 0.0041 | 0.1107 | 0.0102 | 0.0208 | 0.0203 | 0.1964 |
| DiMSC | 0.2189 | 0.2806 | 0.0503 | 0.0191 | 0.0061 | 0.1142 | 0.0113 | 0.0212 | 0.0341 | 0.2137 |
| LRAcluster | 0.0086 | 0.0076 | 0.0051 | 0.0184 | 0.0054 | 0.0098 | 0.0055 | 0.0263 | 0.0392 | 0.2546 |
| PINS | 0.0144 | 0.0089 | 0.0045 | 0.0244 | 0.0067 | 0.0065 | 0.0016 | 0.0152 | 0.0136 | 0.1195 |
| SNF | 0.0126 | 0.0027 | 0.0062 | 0.0119 | 0.0063 | 0.0026 | 0.00062 | 0.0238 | 0.0157 | 0.3667 |
| iClusterBayes | 0.0045 | 0.0357 | 0.0023 | 0.0076 | 0.0048 | 0.0034 | 0.0043 | 0.0399 | 0.0288 | 0.0482 |
| MVDA | 0.2457 | 0.3441 | 0.0614 | 0.0194 | 0.0085 | 0.1351 | 0.0145 | 0.02366 | 0.0355 | 0.2021 |
| AvgProb | 0.021 | 0.0512 | 0.0091 | 0.00651 | 0.0096 | 0.0851 | 0.0013 | 0.0209 | 0.0205 | 0.0508 |
| AvgData | 0.2107 | 0.3404 | 0.0716 | 0.1941 | 0.0091 | 0.1261 | 0.01321 | 0.01326 | 0.01201 | 0.2709 |
macro F1-score and Accuracy values obtained by MvNE for all the datasets.
| Datasets | macro F1-score | Accuracy |
|---|---|---|
| BRCA | 0.6632 | 0.6701 |
| GBM | 0.6801 | 0.6918 |
| OVG | 0.4883 | 0.4765 |
| COAD | 0.4514 | 0.4531 |
| LIHC | 0.4457 | 0.4612 |
| LUSC | 0.5856 | 0.5771 |
| SKCM | 0.5031 | 0.5118 |
| SARC | 0.5503 | 0.5517 |
| KIRC | 0.4675 | 0.4718 |
| AML | 0.6904 | 0.7013 |
Figure 6Heatmap showing the levels of expression of selected gene markers in the BRCA dataset for each subclass.
Figure 7Gene expression profile plot in the BRCA dataset for each subclass.
Figure 8Error plot for low dimension generation.